Hey, Enjoying 1.4.8? Thought I'd share some rough things that you guys may enjoy:
https://github.com/dormando/mc-crusher ^ I've thrown my hat into the ring of benchmark utilities. This is probably on par with some work Dustin's been doing, but I went in a slightly different direction with features. Now, 1.4.9-beta1: http://memcached.googlecode.com/files/memcached-1.4.9_beta1.tar.gz Which is the result of the 14perf tree, now up: https://github.com/memcached/memcached/commits/14perf This beta will be up for at least two weeks before going final. The changes need more tuning, and some normal bugfixes/feature fixes need to go in as well. I'm giving it to you folks early so it has a good long soak. Major changes: - The "Big cache lock" is much shorter. Partly influenced by the patches Ripduman Sohan sent, as well as me trying 3 different approaches and failing back to this one. - a "per-item" hash table of mutex locks is used to widen the amount of locks available. There are many instances where we don't want two threads to progress on the same item in parallel, but many fewer places where it's paramount for the hash table and LRU to be accessed by a single thread. - cache_lock uses a pseudo spinlock. In my bench testing, preventing threads from going to sleep when hitting the short cache_lock helped with thread scalability quite a bit. - item_alloc no longer does a depth search for items to expire or evict. I gave it a lot of thought and am dubious it ever helped. If you don't have any expired items at the tail, it will always iterate 50 items, which is slow. This was one of the larger performance improvements from the changes I made. - Hash calculations are now mostly done outside of the big lock. This was a change in 1.6 already. Reasoning: - Most was reasoned above. I looked through Ripduman's patches and decided to go a slightly different route. I studied all of the locks carefully to audit what changes to make. In addition, I made no change which significantly increases the memory usage. While we can still release a specialized engine which inflates datastructures in a tradeoff for speed, I have a strong feeling it's not even necessary. - I only kept patches that had a measurable benefit. I threw away a lot of code! Results: - On my desktop, I was able to increase the number of set commands per second from 300,000 to 930,000. - With one get per request, I saw 500k to 600k per second. This was largely limited by the localhost driver, it may be faster with real hardware. - With multigets, I was able to drive up to 4.5 million keys per second. (4.5 million get_hits per second). Reality will be a bit lower than this. - Saturate 10gbps of localhost traffic with 256-512 byte objects. - Saturate 35gbps of localhost traffic with 4k objects. - Saturate 45gbps of localhost traffic with 8k objects. - Patches increase the thread scalability. Under high load, performance dropoffs now happen around 5 or 6 threads, whereas previously as many as 4 (the default!) could cause slowdown. Future work: I have some ideas to play with, some might go into 1.4.9, some later. I don't believe any further performance enhancement is really necessary, as it's trivial to saturate 10gbps of traffic now. Need to hammer out more of the bench tool and make a formal blog post with pretty pictures. That's more interesting. - Item hash needs tuning. It's using a modulo instead of a hashmask. Needs a way to initialize the size of the table, etc. - I played with using the intel hardware crc32c instruction, but that lowered performance as it slammed the locks together too early. This needs more work before I push the branch up, as well as verification as to the hash distribution. - It may be safe to split the cache_lock into cache_lock and lru_locks, but I haven't verified the safety of this personally yet and the performance is already too high for my box to verify the change. Notes: - NUMA is kind of a bitch. If you want to reproduce my results on a big box, you'll need to bind memcached to a single numa node: numactl --cpunodebind=0 ./memcached -m 4000 -t 4 You can also try twiddling --interleave and seeing how the performance changes. There isn't a hell of a lot we can do here, but we can move many connections buffers to be "numa-local" and get what we can out of it. The performance, even with memcached interleaved, isn't too bad at all, and the patches do improve things (for me). - I have not done any verification on latency yet. Given the low number of connections I've been using in testing, it's not really possible for requests to have taken longer than 0.1ms. Still, over the weeks I will build the necessary functionality into mc-crusher and more formally test how latency is affected by a mix of set/get commands. have fun, and everyone who makes presentations about "memcached scales to a limit" can bite me. If you honestly need it to run faster than this, just send us a fucking e-mail. If you like what I do and would like to see projects I work on deal better with NUMA or 10gpbs ethernet, see here: http://memcached.org/feedme -Dormando
