Re: Further analysis of the GC issue

Richard Hirsch Thu, 26 Nov 2009 01:10:33 -0800

@Markus It would be interesting to remove the Textile parser and do
the tests again.


This would confirm whether it is the culprit or not. If I remember
correctly, it was just a change in one line of code.

Just found the change
(http://svn.apache.org/viewvc/incubator/esme/trunk/server/src/main/scala/org/apache/esme/model/Message.scala?r1=804817&r2=819509&diff_format=h)
You could change the code to the older version and try it again

D.

On Thu, Nov 26, 2009 at 10:03 AM, Markus Kohler <[email protected]> wrote:
> Hi Michael,
> Good to see you here!
>
> "Memory Analyzer"? that's me ;-)
>
> The 23 Gbyte are not "retained" at one point in time, but they are the sum
> of all temporary allocated objects, most of memory, (or all of it, there
> doesn't seem to be an obvious memory leak), are gone within a millisecond.
> I'm confident that this value can be decreased to 90Mbyte and can be further
> improved down to a few MByte (or even less). We already know that the
> 90Mbyte are mostly caused be an inefficient textile parser.
>
> I also used the Memory Analyzer to look at how much memory is retained, e.g.
> still in use/referenced after the user interaction has been finished. The
> report is here
> http://cwiki.apache.org/confluence/display/ESME/Performance+test+-+2009-11-22
> Also there's room for improvement, potentially caused by the same bug that
> turned 90Mbyte into 23Gbyte, I don't see any major issues yet with regards
> to memory usage.
>
> This is also related to the state less versus state full discussion, ATM the
> amount of state needed for one user is already very low ( a few hundred
> kByte), at least compared to what I'm used to with Enterprise Applications.
> It is at least an order of magnitude lower, which can only partially
> explained by ESME being less complex than the typical Enterprise app.
> So far I don't see any major road block from the design perspective that
> would stop us from scaling very well.
>
> In my experience, it's quite normal that as soon as someone with a little
> bit of experience in performance takes as closer look at a software, that a
> few dramatic improvements can be made. That makes working as a performance
> analysis expert so gratifying. You suggest a few improvements, which have an
> dramatic impact, and then you walk away before it gets too complicated ;-)
> No, that's not my intention here :-)
>
>
> Markus
>
> "The best way to predict the future is to invent it" -- Alan Kay
>
>
> On Thu, Nov 26, 2009 at 6:04 AM, Bechauf, Michael
> <[email protected]>wrote:
>
>> David,
>>
>> well, "dead wrong" is a strong expression; hopefully I'm still breathing. I
>> don't want to judge without having looked at the code myself, but I have no
>> idea how a massive multi-user system could possibly be designed with state
>> where per-user information is kept in memory for a certain time. I mean, 23
>> GB allocated - that's tough for an SAP transaction server that is not
>> mutlithreaded and where the memory management is highly optimized based on
>> shared memory that the work processes can attach to, or rolled out to a file
>> if unused for a whilet. It is, however, deadly for a VM that was never
>> designed for such memory consumption and where a GC run can halt the server.
>>
>> Anyway, I'll study this a bit more, particularely the Scala architecture. I
>> heard many good things about Scala, but in the end it's all translated to
>> things a VM can understand, and I hope Scala does a good enough job managing
>> this load in a transparent way.
>>
>> -Michael
>>
>>
>> ----- Original Message -----
>> From: David Pollak <[email protected]>
>> To: [email protected] <[email protected]>
>> Sent: Wed Nov 25 23:00:20 2009
>> Subject: Re: Further analysis of the GC issue
>>
>> On Wed, Nov 25, 2009 at 7:16 PM, Bechauf, Michael
>> <[email protected]>wrote:
>>
>> > Wasn't this exactly the kind of stuff that the Eclipse Memory Analyzer -
>> > donated by SAP - was supposed to fix ? A heap of that size for a still
>> > moderate number of 300 users is crazy, so either there is stuff like
>> > circular references that hog memory, or the design model is fundamentally
>> > flawed. I don't understand why ESME needs "sessions" ? How can a
>> scaleable
>> > server be created if each user will allocate memory until some timeout.
>> In a
>> > world of stateless browser-based UIs that's not going to work.
>> >
>>
>> You're actually dead wrong about this.  "Stateless" is not... it's just
>> pushing state and cache someplace else (the RDBMS, memcached, etc.).
>> "Stateless" will lead to radical performance problems.  "Stateless" merely
>> moves the caching decisions into code you don't control.  I dealt with this
>> issue first-hand while helping a popular micro-blogging site migrate from a
>> "stateless" to a Scala-based backend.  I'm dealing with this issue
>> first-hand helping another popular site that's experiencing exponential
>> growth migrate away from "push everything back to the RDBMS and hope for
>> the
>> best."
>>
>> My original design for ESME is stateful.  My original design for ESME is
>> based on lessoned learned in this very space and was oriented to have
>> things
>> intelligently cached so that the caching is not based on RDBMS indexes.
>>  I'm
>> not sure what happened to cause the particular issues, but it seems like
>> folks are loading messages from the RDBMS rather than asking the message
>> cache for them.
>>
>>
>> >
>> > Time for me to look at that code ...
>> >
>> > -Michael
>> >
>> >
>> > ----- Original Message -----
>> > From: Markus Kohler <[email protected]>
>> > To: [email protected] <[email protected]>
>> > Sent: Wed Nov 25 12:14:58 2009
>> > Subject: Further analysis of the GC issue
>> >
>> > Hi all,
>> > the Garbage Collector issue I was talking about is reproducible.
>> > I've uploaded an annotated GC graph to
>> >
>> >
>> http://picasaweb.google.com/lh/photo/wB-RRtb0wIVfpxJkTJPNuw?authkey=Gv1sRgCOve7LThpfvXsQE&feat=directlink
>> >
>> > I think the "LOGON" phase where I logon all the 300 users looks ok (given
>> > that probably textile formatting is involved) but the phase where just
>> one
>> > user sends one message is certainly not looking good.
>> >
>> > I took the profiler and the result is a bit shocking. For that one
>> message,
>> > 881.000.000 objects weighting  23,2 Gbyte where allocated (and reclaimed
>> > afterwards). My former record was 2Gbyte ;-)
>> >
>> > Fortunately I have a theory what happens, without looking into the
>> > code,yet,
>> > so take it with a grain of salt. It seems that the public time line for
>> all
>> > users is re-rendered, because 99% of the allocations come
>> > from org.apache.esme.comet.PublicTimeline.render(). I guess all the
>> actors
>> > for all the users are sitting there, not knowing that the user has closed
>> > the browser, because the user session has not yet expired.
>> >
>> > I wonder how we get around this with a real "push" model. If the browser
>> > would ask for updates this rendering could be done lazily. Or can we
>> "ping"
>> > the browser and check whether it responds?
>> > On the other side. It should also not be necessary the re-render the
>> > message
>> > again and again because the result will be the same.
>> >
>> > I will send Richard some attachments. Not sure whether you will need
>> them,
>> > they look very similar to the ones we already have.
>> >
>> > BTW, we should definitely check the use
>> > of scala.xml.XML$.loadString(java.lang.String)
>> > It's creating a new Parser each time, which is a bit costly because it
>> > allocates a new Buffer each time and also hits the disk, when searching
>> for
>> > the name of the Java class.
>> >
>> > Greetings,
>> > Markus
>> >
>> >
>> >
>> > "The best way to predict the future is to invent it" -- Alan Kay
>> >
>>
>>
>>
>> --
>> Lift, the simply functional web framework http://liftweb.net
>> Beginning Scala http://www.apress.com/book/view/1430219890
>> Follow me: http://twitter.com/dpp
>> Surf the harmonics
>>
>

Re: Further analysis of the GC issue

Reply via email to