Wow David, that's a fantastic explanation. Thank you for taking the time to write it. Should help a lot in the future (especially when I have to remind myself why I'm messing around with Actors and Futures in the api2 endpoint :-).
Can we put this or an edited version on the wiki? Ethan On Mon, Nov 30, 2009 at 3:40 PM, David Pollak <[email protected]> wrote: > Folks, > > Over the last 6 or so months, we've had a bunch of discussions on the list > about statefulness, REST, and ESME's overall design. I want to walk through > the design choices I've made for ESME and why "stateless" and other such > designs fail and are dead wrong for a social networking system. > > There is no such thing as stateless. Every web site has state. The state > may change frequently or may change infrequently. A web site made up of > static files has its state based on those static pages. When those pages > are changed, the state changes. State is kept somewhere for all web sites. > > Some web sites will present a different state depending on who is accessing > the site. This can be as simple as serving different pages depending on the > IP address or language preference expressed in the HTTP headers. This is > sessionful. The content is calculated based on the request. This may be > more sophisticated in terms of authenticating the HTTP request and > presenting content based on the authentication. > > A session for sessionful content may be short-lived (the length of the > request) or it may be longer lived (typically this is done with an initial > authentication phase resulting in a shared secret [JSESSIONID] that is > presented as an authentication proxy in subsequent requests.) > > But no matter the authentication mechanism or the session lifespan, there > must exist a mechanism for translating the HTTP request into the content > presented for the session. > > Far and away the most common way of persisting and calculating state is in a > relational database (RDBMS). RDBMSs are awesome creatures. They sit on top > of some excellent and well understood mathematics: set theory. They have > well known and well understood concurrency mechanisms: transactions. They > have been designed, built, tested, and optimized over the last generation. > RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) as > well as a generally human understandable set of semantics: people understand > that RDBMSs are a sets of things and there are simple ways to ask about > these sets. RDBMSs have evolved along with ERP systems and have evolved to > meet the needs of these systems. > > However, there are well known things that RDBMSs don't do well that include > tree structures (yeah, Oracle and others have extensions for tree walks, but > nothing is part of the SQL spec and the performance of these extensions is > not always the same as other models: a tree-walk in an RDBMS costs O(log n) > for each node where a tree walk in an OO system costs O(1)). Social > networks/social graphs are another place where RDBMSs do not excel. > > Let's dive down into this. > > A naive implementation of a social messaging site runs something like these > tables: > > - Users(id, name, password) > - Friends(owner, friend) > - Messages(id, poster, content, date) > > So, if we wanted to calculate the timeline for a given user at a given > instant, the query would look like: > > SELECT messages.* FROM messages, friends WHERE friends.owner = current_user > AND messages.poster = friends.friend ORDER BY messages.date DESC LIMIT 20 > > Assuming we've got indexes on friends.owner, messages.poster and > messages.date, the query still results in O(n log n) where n is the > aggregate number of messages posted. This is non-trivial and if you follow > someone who has posted 20,000 messages (yeah Anne, I'm talkin' to you), the > n log n cost becomes non-trivial. > > Basically, each time a client asks for the latest timeline, you've got an > O(n log n) operation to determine state. This doesn't scale. > > The first obvious response to the issue is caching (capturing the state > beyond the duration of a short-lived session). I'm going to skip caching > for a moment and do a more sophisticated implementation of timelines so we > can get better performance. > > Let's create a mailbox table. Each time someone publishes a message, a > reference to that message will be put in a Mailbox(owner, message, date) > table and we'll create an index on the table: (owner, date DESC) > > This changes the query to: > > SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user > AND messages.id = mailbox.message ORDER BY mailbox.date DESC LIMIT 20 > > Depending on your RDBMS, you will wind up with an O(log n) operation. You > find the newest mailbox entry by user (O(log n)) and do an index walk until > you've found 20 entries (I'm putting aside the fact that looking up the 20 > messages is an O(n log n) operation because 20 is a small number and the > messages will likely be in the database's cache... this operation is going > to be fast.) > > I'm going to sidetrack for a moment. I had the pleasure of talking over a > few beers at a baseball game with one of the senior engineers at Facebook. > We were talking about Facebook's scaling success. His comment was that it > was successful but very expensive. If there were more than 3% cache misses > from MySQL queries, the system would back up. If they got more than 2% > cache misses from the memcached stuff in front of their MySQL servers the > system would back up. So, basically Facebook has 195% of their data in RAM. > > The net is that O(log n) is only going to work if you've got your entire > index in the cache of your RDBMS. Even a dozen disk reads is going to turn > a 10ms query into a 250ms query and if you've got 1,000 users asking for a > status update, you'll wind up with disk thrashing and ultimately you will > not be able to satisfy all of those requests. > > Let's make our discussion more concrete. I'm assuming that an ESME instance > will support 25,000 users. On average, a user will follow 100 people (100x > fan-out of messages). Users will post one message every 30 minutes (48 > messages a day). The day lasts 10 hours (this is a reasonable approximation > for peakiness... basically, you're compressing 48 message sends in to a 10 > hour period). There are 300 days in a year. These numbers are averages and > there will be some folks who are above average in terms of fan out (the CEO > will have a 25,000x fan out) and some folks are above average in number of > messages per day (yeah Anne, I'm lookin' at you... you too Dick.) > > So, that means that each year, there will be 36,000M (36B) mailbox entries. > If each entry costs us 16 bytes of RAM for index purposes, that means we're > at 576B bytes of index. There's no way that amount of index will fit in > RAM. So, what happens if the average messages/day drops to 1, you're still > looking at 10GB of index. Alternatively, you could purge messages after 3 > weeks or limit timelines to a certain number of messages. That's not > unreasonable, but it's also adding a constraint to the system to deal with > limitations of the RDBMS. There are other alternatives. > > Let me talk memcached for a minute. In my opinion, memcached means that you > have a failed design. Memcached means abandoning all the awesome things > that you get with an RDBMS: a mathematical model, a > concurrency/transactional model, durability guarantees, etc. > > But, we could move our state from the calculate-on-demand model of the RDBMS > to the a calculate once and cache model using memcached. This means that > you only take the nasty hits if the cache is not valid. Putting aside the > cost of cache invalidation (I haven't covered the costs of updates in this > discussion because there's no need to go there... the implementation > failures can be demonstrated with just reads), if you have a simple cache > invalidation scheme, most of the cache entries will not survive for 15 > minutes (I can go through the math, but I'm going to leave this one to the > reader). You risk cache stampedes (more than 1 process rebuilding the cache > entry). Basically, the naive memcached implementation buys you a little bit > of head room over the naive (non-mailbox) approach. In order to get more > than 5x or so improvement (something that will serve a few thousand rather > than a few hundred users), you need to manipulate the cache entries > inserting/deleting individual messages. > > The above paragraph in fact leads us in the direction of a better answer. > > But first, let me state that I have proven that an RDBMS cannot be the sole > locus of state for a social messaging site that services more than a few > hundred users. Period. We must move state somewhere else and manage the > cached state manually rather than with queries and indexes. Second, I have > not discussed short-lived vs. long-lived sessions yet. I will get to that, > but first, let's walk through a design that gives us a concurrency model as > well as the performance we want. > > Imagine a model where you interact with a User with a limited set of > (asynchronous) messages: > > - add/remove friend > - add message to timeline > - post message (the user has created a message and it needs to be > processed) > - get current timeline (with offsets and number of entries) > > These are the basic messages needed to implement a social messaging site. > If we guaranty that a User will only process 1 message at a time, we have a > concurrency model. It's simple and simple is good. We have not defined > how/where Users store there state (it could be on a filesystem, in an RDBMS, > in a NoSQL store, who knows). But we can say that adding a message is an > O(1) operation (prepending to the head of a singly linked list). Each User > can have a caching policy (and that caching policy could be dynamic based on > the access characteristics for the User). The sender of the message doesn't > block on the processing of the message (although the get current timeline > message will have an asynchronous response that the sender will likely block > on). > > We have changed our abstraction from one where all data (tables and indexes) > are created equal to one where certain data structures are more prominent > (User and Message) than others (mailbox, friends). > > We have lost something: transactions. In this model, if I add Dick as a > friend, I am not guaranteed that I will receive Dick's next update... it may > take time for the messages to propagate to Dick's User and his Message may > be sent before the "add friend" message gets to him. In the case of a > financial transaction, this would be fatal. In the case of social > networking, this is a perfectly reasonable trade-off. > > So far, we have not talked about long-lived sessions and how they are > valuable in such a model... an in particular in ESME. > > If we add one more message to our User, some of the reasons for long-lived > sessions should become obvious: updated me on timeline change. If you can > register with the User for changes to the timeline it means that we don't > have to keep asking "are we there yet?" When state change happens, it's > instantly propagated out to the listeners. The alternative is for the > listeners to ask "are we there yet?" over and over. The cost of asking "are > we there yet?" is non-trivial as anyone who has traveled with 5 year olds > can attest to. Additionally, sometimes, when one if having a conversation, > it's nice to get an immediate response rather than waiting some polling > period. Additionally, with a listener model, the User does not need to > store the date of each message (give me new messages since xxx) and that > cuts down cache storage costs by 50% (a big number across 25,000 users). > > So, having a long-lived session has some performance benefits over a > short-lived session and polling, but this only part of the story. > > One of the ways that RDBMSs get performance (and the way products like > Oracle distinguish themselves from the likes of MySQL) is the ability to > cache optimized query plans, cache the right data, and invalidate the right > caches at the right time. The same requirements are going to come up in > ESME. > > When I designed ESME, I changed the model from a Skittr model (1M users on a > single box) to a more enterprise-friendly model. The key difference is that > I added the "actions" feature where each User got to see each message > processed in the system and analyze that message for content/context and > perform certain actions based on that analysis. Things like "add all > message containing 'catfood' to my timeline" or forward all messages > containing "ESME to my followers" or "make an HTTP post of all messages from > my boss to a paging service" or "block 50% of the messages from Joe > Blabbermouth". Actions are cool, but they are costly. It means that every > message must be compared to every action definition in the system. This is > expensive. If each user has an average of 10 actions, that means each > message sent will have to be compared against 250,000 actions and if we have > a peak of 5 messages per hour per person, that's 31B comparisons per hour at > peak time or 9M action comparisons per second. That's load. > > During peak load, we will need to prioritize which Users are processing > messages/actions such that the system retains responsiveness and can drain > the load. Put another way, knowing which Users have associated long-lived > sessions allows us to prioritize the message processing for those Users. We > allow more threads to drain the message queues for those Users while > providing fewer threads for session-less Users. Yeah, we could prioritize > on other heuristics, but long-lived session is dead simple and will cost us > 5K bytes per logged in user. Not a huge cost and lots of benefit. > > So, between the existing long-lived session long polling is more efficient > than shortlived session repeated polling and the upcoming need for message > prioritization indicate that long-lived sessions are the right design > choice. > > Also, I hope that the above discussion makes it clear why I am insistent on > message-oriented APIs rather than document/REST oriented APIs. ESME's > design is not traditional and there are fewer tools helping us get the > implementation right. On the other hand, implementing ESME on top of a > relational/REST model cannot be done. Let's keep our design consistent from > the APIs back. > > Thanks, > > David > > -- > Lift, the simply functional web framework http://liftweb.net > Beginning Scala http://www.apress.com/book/view/1430219890 > Follow me: http://twitter.com/dpp > Surf the harmonics >
