Re: Further analysis of the GC issue

Richard Hirsch Sat, 28 Nov 2009 02:18:26 -0800

Wiki page is back again


On Sat, Nov 28, 2009 at 10:19 AM, Richard Hirsch <[email protected]> wrote:
> For some reason the wiki page about the performance test on 11-25 was
> lost, I'll have to create once again.....
>
> On Fri, Nov 27, 2009 at 5:47 AM, Richard Hirsch <[email protected]> wrote:
>> Moved this whole thread to the wiki:
>> http://cwiki.apache.org/confluence/display/ESME/Performance+test+2009-11-25
>>
>> D.
>>
>> On Thu, Nov 26, 2009 at 2:22 PM, Markus Kohler <[email protected]> 
>> wrote:
>>> Hi Michael,
>>> No problem :-)
>>>
>>>
>>>
>>> Regards,
>>> Markus
>>>
>>> "The best way to predict the future is to invent it" -- Alan Kay
>>>
>>>
>>> On Thu, Nov 26, 2009 at 2:12 PM, Bechauf, Michael
>>> <[email protected]>wrote:
>>>
>>>> Thanks Markus. That certainly sounds much better. I was confused
>>>> yesterday already because 23 GByte memory would be a little difficult to
>>>> create when not even the operating system can handle such size. I should
>>>> have asked right away. Blame it on jetlag.
>>>>
>>>> -Michael
>>>>
>>>> -----Original Message-----
>>>> From: Markus Kohler [mailto:[email protected]]
>>>> Sent: Thursday, Nov 26, 2009 1:04 AM
>>>> To: [email protected]
>>>> Subject: Re: Further analysis of the GC issue
>>>>
>>>> Hi Michael,
>>>> Good to see you here!
>>>>
>>>> "Memory Analyzer"? that's me ;-)
>>>>
>>>> The 23 Gbyte are not "retained" at one point in time, but they are the
>>>> sum
>>>> of all temporary allocated objects, most of memory, (or all of it, there
>>>> doesn't seem to be an obvious memory leak), are gone within a
>>>> millisecond.
>>>> I'm confident that this value can be decreased to 90Mbyte and can be
>>>> further
>>>> improved down to a few MByte (or even less). We already know that the
>>>> 90Mbyte are mostly caused be an inefficient textile parser.
>>>>
>>>> I also used the Memory Analyzer to look at how much memory is retained,
>>>> e.g.
>>>> still in use/referenced after the user interaction has been finished.
>>>> The
>>>> report is here
>>>> http://cwiki.apache.org/confluence/display/ESME/Performance+test+-+2009-
>>>> 11-22
>>>> Also there's room for improvement, potentially caused by the same bug
>>>> that
>>>> turned 90Mbyte into 23Gbyte, I don't see any major issues yet with
>>>> regards
>>>> to memory usage.
>>>>
>>>> This is also related to the state less versus state full discussion, ATM
>>>> the
>>>> amount of state needed for one user is already very low ( a few hundred
>>>> kByte), at least compared to what I'm used to with Enterprise
>>>> Applications.
>>>> It is at least an order of magnitude lower, which can only partially
>>>> explained by ESME being less complex than the typical Enterprise app.
>>>> So far I don't see any major road block from the design perspective that
>>>> would stop us from scaling very well.
>>>>
>>>> In my experience, it's quite normal that as soon as someone with a
>>>> little
>>>> bit of experience in performance takes as closer look at a software,
>>>> that a
>>>> few dramatic improvements can be made. That makes working as a
>>>> performance
>>>> analysis expert so gratifying. You suggest a few improvements, which
>>>> have an
>>>> dramatic impact, and then you walk away before it gets too complicated
>>>> ;-)
>>>> No, that's not my intention here :-)
>>>>
>>>>
>>>> Markus
>>>>
>>>> "The best way to predict the future is to invent it" -- Alan Kay
>>>>
>>>>
>>>> On Thu, Nov 26, 2009 at 6:04 AM, Bechauf, Michael
>>>> <[email protected]>wrote:
>>>>
>>>> > David,
>>>> >
>>>> > well, "dead wrong" is a strong expression; hopefully I'm still
>>>> breathing. I
>>>> > don't want to judge without having looked at the code myself, but I
>>>> have no
>>>> > idea how a massive multi-user system could possibly be designed with
>>>> state
>>>> > where per-user information is kept in memory for a certain time. I
>>>> mean, 23
>>>> > GB allocated - that's tough for an SAP transaction server that is not
>>>> > mutlithreaded and where the memory management is highly optimized
>>>> based on
>>>> > shared memory that the work processes can attach to, or rolled out to
>>>> a file
>>>> > if unused for a whilet. It is, however, deadly for a VM that was never
>>>> > designed for such memory consumption and where a GC run can halt the
>>>> server.
>>>> >
>>>> > Anyway, I'll study this a bit more, particularely the Scala
>>>> architecture. I
>>>> > heard many good things about Scala, but in the end it's all translated
>>>> to
>>>> > things a VM can understand, and I hope Scala does a good enough job
>>>> managing
>>>> > this load in a transparent way.
>>>> >
>>>> > -Michael
>>>> >
>>>> >
>>>> > ----- Original Message -----
>>>> > From: David Pollak <[email protected]>
>>>> > To: [email protected] <[email protected]>
>>>> > Sent: Wed Nov 25 23:00:20 2009
>>>> > Subject: Re: Further analysis of the GC issue
>>>> >
>>>> > On Wed, Nov 25, 2009 at 7:16 PM, Bechauf, Michael
>>>> > <[email protected]>wrote:
>>>> >
>>>> > > Wasn't this exactly the kind of stuff that the Eclipse Memory
>>>> Analyzer -
>>>> > > donated by SAP - was supposed to fix ? A heap of that size for a
>>>> still
>>>> > > moderate number of 300 users is crazy, so either there is stuff like
>>>> > > circular references that hog memory, or the design model is
>>>> fundamentally
>>>> > > flawed. I don't understand why ESME needs "sessions" ? How can a
>>>> > scaleable
>>>> > > server be created if each user will allocate memory until some
>>>> timeout.
>>>> > In a
>>>> > > world of stateless browser-based UIs that's not going to work.
>>>> > >
>>>> >
>>>> > You're actually dead wrong about this.  "Stateless" is not... it's
>>>> just
>>>> > pushing state and cache someplace else (the RDBMS, memcached, etc.).
>>>> > "Stateless" will lead to radical performance problems.  "Stateless"
>>>> merely
>>>> > moves the caching decisions into code you don't control.  I dealt with
>>>> this
>>>> > issue first-hand while helping a popular micro-blogging site migrate
>>>> from a
>>>> > "stateless" to a Scala-based backend.  I'm dealing with this issue
>>>> > first-hand helping another popular site that's experiencing
>>>> exponential
>>>> > growth migrate away from "push everything back to the RDBMS and hope
>>>> for
>>>> > the
>>>> > best."
>>>> >
>>>> > My original design for ESME is stateful.  My original design for ESME
>>>> is
>>>> > based on lessoned learned in this very space and was oriented to have
>>>> > things
>>>> > intelligently cached so that the caching is not based on RDBMS
>>>> indexes.
>>>> >  I'm
>>>> > not sure what happened to cause the particular issues, but it seems
>>>> like
>>>> > folks are loading messages from the RDBMS rather than asking the
>>>> message
>>>> > cache for them.
>>>> >
>>>> >
>>>> > >
>>>> > > Time for me to look at that code ...
>>>> > >
>>>> > > -Michael
>>>> > >
>>>> > >
>>>> > > ----- Original Message -----
>>>> > > From: Markus Kohler <[email protected]>
>>>> > > To: [email protected] <[email protected]>
>>>> > > Sent: Wed Nov 25 12:14:58 2009
>>>> > > Subject: Further analysis of the GC issue
>>>> > >
>>>> > > Hi all,
>>>> > > the Garbage Collector issue I was talking about is reproducible.
>>>> > > I've uploaded an annotated GC graph to
>>>> > >
>>>> > >
>>>> >
>>>> http://picasaweb.google.com/lh/photo/wB-RRtb0wIVfpxJkTJPNuw?authkey=Gv1s
>>>> RgCOve7LThpfvXsQE&feat=directlink
>>>> > >
>>>> > > I think the "LOGON" phase where I logon all the 300 users looks ok
>>>> (given
>>>> > > that probably textile formatting is involved) but the phase where
>>>> just
>>>> > one
>>>> > > user sends one message is certainly not looking good.
>>>> > >
>>>> > > I took the profiler and the result is a bit shocking. For that one
>>>> > message,
>>>> > > 881.000.000 objects weighting  23,2 Gbyte where allocated (and
>>>> reclaimed
>>>> > > afterwards). My former record was 2Gbyte ;-)
>>>> > >
>>>> > > Fortunately I have a theory what happens, without looking into the
>>>> > > code,yet,
>>>> > > so take it with a grain of salt. It seems that the public time line
>>>> for
>>>> > all
>>>> > > users is re-rendered, because 99% of the allocations come
>>>> > > from org.apache.esme.comet.PublicTimeline.render(). I guess all the
>>>> > actors
>>>> > > for all the users are sitting there, not knowing that the user has
>>>> closed
>>>> > > the browser, because the user session has not yet expired.
>>>> > >
>>>> > > I wonder how we get around this with a real "push" model. If the
>>>> browser
>>>> > > would ask for updates this rendering could be done lazily. Or can we
>>>> > "ping"
>>>> > > the browser and check whether it responds?
>>>> > > On the other side. It should also not be necessary the re-render the
>>>> > > message
>>>> > > again and again because the result will be the same.
>>>> > >
>>>> > > I will send Richard some attachments. Not sure whether you will need
>>>> > them,
>>>> > > they look very similar to the ones we already have.
>>>> > >
>>>> > > BTW, we should definitely check the use
>>>> > > of scala.xml.XML$.loadString(java.lang.String)
>>>> > > It's creating a new Parser each time, which is a bit costly because
>>>> it
>>>> > > allocates a new Buffer each time and also hits the disk, when
>>>> searching
>>>> > for
>>>> > > the name of the Java class.
>>>> > >
>>>> > > Greetings,
>>>> > > Markus
>>>> > >
>>>> > >
>>>> > >
>>>> > > "The best way to predict the future is to invent it" -- Alan Kay
>>>> > >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Lift, the simply functional web framework http://liftweb.net
>>>> > Beginning Scala http://www.apress.com/book/view/1430219890
>>>> > Follow me: http://twitter.com/dpp
>>>> > Surf the harmonics
>>>> >
>>>>
>>>
>>
>

Re: Further analysis of the GC issue

Reply via email to