Hi all, Again moving this thread to the mailing list. See my comments below.
Regards, Markus "The best way to predict the future is to invent it" -- Alan Kay On Sun, Nov 22, 2009 at 5:39 PM, Ethan Jewett <[email protected]> wrote: > Hi Markus, > > This is intriguing, though it's actually not that surprising. I think > David and Vassil might have explanations for a lot of this, but maybe > I can shed a little light based on how I understand ESME should work: > > On Sat, Nov 21, 2009 at 5:11 PM, Markus Kohler <[email protected]> > wrote: > > > This means that "Europe/Berlin" appears 11729 times in memory at the > point > > in time the heap dump was taken. If those Strings could be > removed 750.656 > > or more bytes could be reclaimed. This copies seems to come > > from net.liftweb.http.RenderOut btw. > > For shallow versus retained size/heap > > check > http://kohlerm.blogspot.com/2009/02/how-to-really-measure-memory-usage-of.html > . > > If you look further below in the table you can see that it seems that > each > > message is there 323 times, which is equal to the number of users. > > Note that I manually created only 2 messages "Test123" and "Test1234", > the > > others are just status updates. > > This means that with n users potentially producing n times more messages > > than a single user, that will not only need n times more memory, but n > times > > n e.g. O(n^2). > > I think that theoretically this isn't quite right. A lot of the > duplicate messages you are seeing are in the public timeline queues > that are created when a user is logged into the website (or an API > session). Messages do duplicate in these queues so that the exact set > of messages the user needs is available for download. Once the user > downloads them to the web interface or via the API delta mechanism, > the queue should be cleared out. Ok, I think I understand what you mean, but shouldn't we then see less duplicates for the last user because this user should have got all the updates. And even then i think the message text is exactly the same there is not personalization involved and therefore it's not necessary to store the duplicates of the Strings. If it's needed that the messages are copied of each user one could still share the String for equal messages. The messages are immutable so there shouldn't be a problem with this approach, I think. > (Would be excellent to test this.) So > worst case, you have O(n*m) memory usage where n is the number of > created messages and m is the number of *logged in* users. > > That's still pretty terrible, but it's not out of hand. I think this > points us in the direction of getting rid of the public timeline > altogether, since it is most of the problem. Without the public > timeline, if every user is following 10 other users, then you only get > O(10*m) where m is the number of users logged in. > > I agree that the public timeline is a major issue at least for larger number of users. I think I already suggested to remove it. But I think one major issue with scaling twitter like services, is that there might be burst of messages after certain poplar events and that the nr. of followers probably follows the power law (http://en.wikipedia.org/wiki/Power_law) This means there are a few users with an very high follower rate which could cause message storms and essentialy cause a lot of messages to be copied. > It's possible that this is not the behavior you're seeing, but if it's > not, then I think this is a bug. > > I think the easiest way to analyze this is I to show you how to use MAT and then you can browser the the object graph. > > This will kill us as the number of users goes up. I understand that in > Scala > > a lot of objects can be immutable, but I still think it doesn't make > sense > > to hold copies of those Strings. > > It does't make sense in general, but I think it does make sense when > you are queuing up messages that you know users are going to request. > Making the queues immutable and, more importantly, self-contained > results in a situation in which a single user's request for new > messages is extremely cheap. All the work has already been done. The > inability to split the work to generate the message queue from the > actual request for the queue was what killed Twitter and what David (I > understand) had a large hand in fixing. That experience informs a lot > of the work done on ESME. I agree on reducing memory footprint, but > there are significant advantages to having this type of duplication > (though keeping the size of the duplication as small as possible would > be excellent). > > Immutable is only a relative I think. For example Strings in Java are immutable, but that doesn't mean that if you do an operation on String that "changes" it that then the whole object graph of the String is copied. In fact Strings point to char[] and those char[] can be shared (people are often surprised), depending on what operations you call. My idea is that the same could be done with the messages. > > There are long chains of scala.$colon$colon everywhere and this class > seems > > to cause quite some overhead ( linked lists I guess) > > Check histogram.zip. The shallow size is already 10% of the heap. > > The dominator tree > > (http://kohlerm.blogspot.com/2009/02/memory-leaks-are-easy-to-find.html > ) grouped > > by class (dominator_tree.zip) shows that 63% of the memory is spend in > the > > the org.apache.esme.comet.PublicTimeline. Haven't looked into the > details, > > but I guess that overhead is caused by those String duplicates. > > I'm not surprised by this. As mentioned above, the public timeline is > something like size O(n*m) in memory while the non-public timelines > are more like O(10*m) in total, for some value of 10. (Ok, I should be > using 1, but my non-computer-science background is showing :-) > So Linked lists are used for Queues? If so, is it "only" because they are easy to use in Scala? I found that in most cases queues can be implemented with Array like structures more efficiently. You just maintain a start end end "pointer" into the Array. Check http://www.java-tips.org/java-se-tips/java.lang/array-based-queue-implementation-in-java.html to get the idea (not sure whether this impl is any good). Arrays are just more compact and cache friendly. > > I think it would be very educational to take a look at what happens > when you have fewer users logged in, when there are more messages, and > ideally with a modified version of ESME that doesn't provide the > public timeline (possibly based on api2 calls?). > > Sure. My plan would be to extend the script, such that users would also follow a few other users and also post messages. Ideally at some point the followers number should follower the power law I guess. But so far there are enough problems. For example posting immediately after all users where logged on is very slow. Regards, Markus > Hopefully that's a bit helpful! > Yes, very helpful indeed! > > Ethan >
