Thanks Dormando!  The performance benchmark you did is really impressive!

On Thursday, October 6, 2011 9:54:44 AM UTC+8, Dormando wrote:
>
> Hey,
>
> Enjoying 1.4.8? Thought I'd share some rough things that you guys may
> enjoy:
>
> https://github.com/dormando/mc-crusher
> ^ I've thrown my hat into the ring of benchmark utilities. This is
> probably on par with some work Dustin's been doing, but I went in a
> slightly different direction with features.
>
> Now, 1.4.9-beta1:
>
> http://memcached.googlecode.com/files/memcached-1.4.9_beta1.tar.gz
>
> Which is the result of the 14perf tree, now up:
>
> https://github.com/memcached/memcached/commits/14perf
>
> This beta will be up for at least two weeks before going final. The
> changes need more tuning, and some normal bugfixes/feature fixes need to
> go in as well. I'm giving it to you folks early so it has a good long
> soak.
>
> Major changes:
>
> - The "Big cache lock" is much shorter. Partly influenced by the patches
> Ripduman Sohan sent, as well as me trying 3 different approaches and
> failing back to this one.
>
> - a "per-item" hash table of mutex locks is used to widen the amount of
> locks available. There are many instances where we don't want two threads
> to progress on the same item in parallel, but many fewer places where it's
> paramount for the hash table and LRU to be accessed by a single thread.
>
> - cache_lock uses a pseudo spinlock. In my bench testing, preventing
> threads from going to sleep when hitting the short cache_lock helped with
> thread scalability quite a bit.
>
> - item_alloc no longer does a depth search for items to expire or evict.
> I gave it a lot of thought and am dubious it ever helped. If you don't
> have any expired items at the tail, it will always iterate 50 items, which
> is slow. This was one of the larger performance improvements from the
> changes I made.
>
> - Hash calculations are now mostly done outside of the big lock. This was
> a change in 1.6 already.
>
> Reasoning:
>
> - Most was reasoned above. I looked through Ripduman's patches and decided
> to go a slightly different route. I studied all of the locks carefully to
> audit what changes to make. In addition, I made no change which
> significantly increases the memory usage. While we can still release a
> specialized engine which inflates datastructures in a tradeoff for speed,
> I have a strong feeling it's not even necessary.
>
> - I only kept patches that had a measurable benefit. I threw away a lot of
> code!
>
> Results:
>
> - On my desktop, I was able to increase the number of set commands per
> second from 300,000 to 930,000.
>
> - With one get per request, I saw 500k to 600k per second. This was
> largely limited by the localhost driver, it may be faster with real
> hardware.
>
> - With multigets, I was able to drive up to 4.5 million keys per second.
> (4.5 million get_hits per second). Reality will be a bit lower than this.
>
> - Saturate 10gbps of localhost traffic with 256-512 byte objects.
>
> - Saturate 35gbps of localhost traffic with 4k objects.
>
> - Saturate 45gbps of localhost traffic with 8k objects.
>
> - Patches increase the thread scalability. Under high load, performance
> dropoffs now happen around 5 or 6 threads, whereas previously as many as 4
> (the default!) could cause slowdown.
>
> Future work:
>
> I have some ideas to play with, some might go into 1.4.9, some later. I
> don't believe any further performance enhancement is really necessary, as
> it's trivial to saturate 10gbps of traffic now.
>
> Need to hammer out more of the bench tool and make a formal blog post with
> pretty pictures. That's more interesting.
>
> - Item hash needs tuning. It's using a modulo instead of a hashmask. Needs
> a way to initialize the size of the table, etc.
>
> - I played with using the intel hardware crc32c instruction, but that
> lowered performance as it slammed the locks together too early. This needs
> more work before I push the branch up, as well as verification as to the
> hash distribution.
>
> - It may be safe to split the cache_lock into cache_lock and lru_locks,
> but I haven't verified the safety of this personally yet and the
> performance is already too high for my box to verify the change.
>
> Notes:
>
> - NUMA is kind of a bitch. If you want to reproduce my results on a big
> box, you'll need to bind memcached to a single numa node:
>
> numactl --cpunodebind=0 ./memcached -m 4000 -t 4
>
> You can also try twiddling --interleave and seeing how the performance
> changes. There isn't a hell of a lot we can do here, but we can move many
> connections buffers to be "numa-local" and get what we can out of it.
>
> The performance, even with memcached interleaved, isn't too bad at all,
> and the patches do improve things (for me).
>
> - I have not done any verification on latency yet. Given the low number of
> connections I've been using in testing, it's not really possible for
> requests to have taken longer than 0.1ms. Still, over the weeks I will
> build the necessary functionality into mc-crusher and more formally test
> how latency is affected by a mix of set/get commands.
>
> have fun, and everyone who makes presentations about "memcached scales to
> a limit" can bite me. If you honestly need it to run faster than this,
> just send us a fucking e-mail.
>
> If you like what I do and would like to see projects I work on deal better
> with NUMA or 10gpbs ethernet, see here: http://memcached.org/feedme
> -Dormando
>
>

Reply via email to