Hi, Oh yes it's the rendering holding the Strings. will send you an html file.
Markus "The best way to predict the future is to invent it" -- Alan Kay On Mon, Nov 23, 2009 at 10:57 PM, Ethan Jewett <[email protected]> wrote: > Actually, to amend this: In looking at the code carefully, I don't see > where the public timeline Comet actors actually duplicate the messages > in the queue. They appear to only store a List of message IDs. > However, it looks like it does possibly pre-render the full list of > messages? I think David or Vassil would have to comment. > > On the other hand, I think the API actor does duplicate the messages, > so I'll have to look at fixing that up if it is really the case. > > I'd love to learn MAT. Just a matter of finding the time! > > Ethan > > On Mon, Nov 23, 2009 at 2:57 PM, Markus Kohler <[email protected]> > wrote: > > Hi all, > > Again moving this thread to the mailing list. > > > > See my comments below. > > > > Regards, > > Markus > > > > > > "The best way to predict the future is to invent it" -- Alan Kay > > > > > > On Sun, Nov 22, 2009 at 5:39 PM, Ethan Jewett <[email protected]> > wrote: > > > >> Hi Markus, > >> > >> This is intriguing, though it's actually not that surprising. I think > >> David and Vassil might have explanations for a lot of this, but maybe > >> I can shed a little light based on how I understand ESME should work: > >> > >> On Sat, Nov 21, 2009 at 5:11 PM, Markus Kohler <[email protected] > > > >> wrote: > >> > >> > This means that "Europe/Berlin" appears 11729 times in memory at the > >> point > >> > in time the heap dump was taken. If those Strings could be > >> removed 750.656 > >> > or more bytes could be reclaimed. This copies seems to come > >> > from net.liftweb.http.RenderOut btw. > >> > For shallow versus retained size/heap > >> > check > >> > http://kohlerm.blogspot.com/2009/02/how-to-really-measure-memory-usage-of.html > >> . > >> > If you look further below in the table you can see that it seems that > >> each > >> > message is there 323 times, which is equal to the number of users. > >> > Note that I manually created only 2 messages "Test123" and "Test1234", > >> the > >> > others are just status updates. > >> > This means that with n users potentially producing n times more > messages > >> > than a single user, that will not only need n times more memory, but n > >> times > >> > n e.g. O(n^2). > >> > >> I think that theoretically this isn't quite right. A lot of the > >> duplicate messages you are seeing are in the public timeline queues > >> that are created when a user is logged into the website (or an API > >> session). Messages do duplicate in these queues so that the exact set > >> of messages the user needs is available for download. Once the user > >> downloads them to the web interface or via the API delta mechanism, > >> the queue should be cleared out. > > > > > > Ok, I think I understand what you mean, but shouldn't we then see less > > duplicates for the last user because this user should have got all the > > updates. > > And even then i think the message text is exactly the same there is > > not personalization involved and therefore it's not necessary to store > the > > duplicates of the Strings. If it's needed that the messages are copied of > > each user one could still share the String for equal messages. The > messages > > are immutable so there shouldn't be a problem with this approach, I > think. > > > > > >> (Would be excellent to test this.) So > >> worst case, you have O(n*m) memory usage where n is the number of > >> created messages and m is the number of *logged in* users. > >> > >> That's still pretty terrible, but it's not out of hand. I think this > >> points us in the direction of getting rid of the public timeline > >> altogether, since it is most of the problem. Without the public > >> timeline, if every user is following 10 other users, then you only get > >> O(10*m) where m is the number of users logged in. > >> > >> > > I agree that the public timeline is a major issue at least for larger > number > > of users. I think I already suggested to remove it. But I think one major > > issue with scaling twitter like services, is that there might be burst of > > messages after certain poplar events and that the nr. of followers > probably > > follows the power law (http://en.wikipedia.org/wiki/Power_law) This > means > > there are a few users with an very high follower rate which could cause > > message storms and essentialy cause a lot of messages to be copied. > > > > > >> It's possible that this is not the behavior you're seeing, but if it's > >> not, then I think this is a bug. > >> > >> > > I think the easiest way to analyze this is I to show you how to use MAT > and > > then you can browser the the object graph. > > > > > >> > This will kill us as the number of users goes up. I understand that in > >> Scala > >> > a lot of objects can be immutable, but I still think it doesn't make > >> sense > >> > to hold copies of those Strings. > >> > >> It does't make sense in general, but I think it does make sense when > >> you are queuing up messages that you know users are going to request. > >> Making the queues immutable and, more importantly, self-contained > >> results in a situation in which a single user's request for new > >> messages is extremely cheap. All the work has already been done. The > >> inability to split the work to generate the message queue from the > >> actual request for the queue was what killed Twitter and what David (I > >> understand) had a large hand in fixing. That experience informs a lot > >> of the work done on ESME. I agree on reducing memory footprint, but > >> there are significant advantages to having this type of duplication > >> (though keeping the size of the duplication as small as possible would > >> be excellent). > >> > >> > > Immutable is only a relative I think. For example Strings in Java are > > immutable, but that doesn't mean that if you do an operation on String > that > > "changes" it that then the whole object graph of the String is copied. In > > fact Strings point to char[] and those char[] can be shared (people are > > often surprised), depending on what operations you call. > > > > My idea is that the same could be done with the messages. > > > > > > > >> > There are long chains of scala.$colon$colon everywhere and this class > >> seems > >> > to cause quite some overhead ( linked lists I guess) > >> > Check histogram.zip. The shallow size is already 10% of the heap. > >> > The dominator tree > >> > ( > http://kohlerm.blogspot.com/2009/02/memory-leaks-are-easy-to-find.html > >> ) grouped > >> > by class (dominator_tree.zip) shows that 63% of the memory is spend in > >> the > >> > the org.apache.esme.comet.PublicTimeline. Haven't looked into the > >> details, > >> > but I guess that overhead is caused by those String duplicates. > >> > >> I'm not surprised by this. As mentioned above, the public timeline is > >> something like size O(n*m) in memory while the non-public timelines > >> are more like O(10*m) in total, for some value of 10. (Ok, I should be > >> using 1, but my non-computer-science background is showing :-) > >> > > > > > > So Linked lists are used for Queues? > > If so, is it "only" because they are easy to use in Scala? I found that > in > > most cases queues can be implemented with Array like structures more > > efficiently. You just maintain a start end end "pointer" into the Array. > > Check > > > http://www.java-tips.org/java-se-tips/java.lang/array-based-queue-implementation-in-java.html > > to > > get the idea (not sure whether this impl is any good). > > > > Arrays are just more compact and cache friendly. > > > > > >> > >> I think it would be very educational to take a look at what happens > >> when you have fewer users logged in, when there are more messages, and > >> ideally with a modified version of ESME that doesn't provide the > >> public timeline (possibly based on api2 calls?). > >> > >> > > Sure. My plan would be to extend the script, such that users would also > > follow a few other users and also post messages. > > Ideally at some point the followers number should follower the power law > I > > guess. > > > > But so far there are enough problems. For example posting immediately > after > > all users where logged on is very slow. > > > > > > > > Regards, > > Markus > > > > > >> Hopefully that's a bit helpful! > >> > > > > Yes, very helpful indeed! > > > >> > >> Ethan > >> > > >
