Come and get 'em: http://michael.susens-schurter.com/blog/2009/03/11/tokyo-cabinet-pytyrant-talk/
Someone asked me about data integrity last night, and I told a long story about TCP backoff algorithm issues. However, I forgot the punchline (aka solution). Here's a better explanation: Lets say we have packets PA, PB, and PC: PA - sent at 10:00:00 am from application node: NA PB - sent at 10:00:01 am from application node: NB PC - sent at 10:01:00 am from application node: NC Unfortunately Mr. Sysadmin was doing a massive rsync while those packets were trying to make their way from the application server to the database (Tokyo Tyrant) server. "Luckily" the C in TCP stands for Control[1], so instead of losing data, the database server's operating system tells senders to backoff for a second and try again later. Now if only 1 connection was being used between the application nodes and the database server, the operating system would insure all TCP packets are processed in the order they were sent, regardless of in what order they were received[2]. Unfortunately we have 3 nodes, and therefore 3 separate TCP connections. No spiffy guaranteed ordering for us. So here's the order the database server receives the packets after telling our nodes to backoff because the rsync backup is saturating its NIC: PB, PC, PA All 3 have the same Key (say, a user's session key), so PA's data ends up being the data written last. When you read from this key again, you'd expect to get PC's value, but instead you get PA. Hilarity ensues. And by hilarity, I mean user data is seemingly randomly lost and they see very strange behavior in their browser. The Solution: a Lua extension to automatically timestamp when each key was written. However this takes cooperation from the client-side as well. The client writes a timestamp as the first X digits of the *value* for every key they put (send to the Tokyo Tyrant). The Lua extension reads this timestamp and saves it in a field named "timestamp.$key" (where $key is the key being saved). The trick is if the timestamp for that key is *newer* than the timestamp on the data that just came in, the Lua extension returns an error and does *not* save the data (because its old). In practice the client actually just silently drops the error because if newer data has already been sent, there's really nothing it needs to do. If the timestamp for incoming data is *newer* than the saved timestamp, the lua extension updates both the timestamp key and the actual key we're trying to safely store. And thats what happens 99.999999999999% of the time. Its worth mentioning the Lua extension is *very* fast since its running right on top of the local Tokyo Cabinet database. So saving 2 key/value pairs instead of 1 does not in fact half your performance since the bottleneck is between PyTyrant and Tokyo Tyrant. Lessons learned: 1. Saturating a network connection can cause very very strange things to happen. 2. All of TCPs fancy congestion control and ordering algorithms are only beneficial if you pipe everything through 1 connection. Hope that makes sense! [1] http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Congestion_control [2] http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Ordered_data_transfer.2C_retransmission_of_lost_packets_and_discarding_duplicate_packets _______________________________________________ Portland mailing list [email protected] http://mail.python.org/mailman/listinfo/portland
