On 03/02/15 05:51 PM, Mike Hommey wrote: > Hi, > > I've been tracking a startup time regression in Firefox for Android when > we tried to switch from mozjemalloc (memory refresher: it's derived from > jemalloc 0.9) to mostly current jemalloc dev. > > It turned out to be https://github.com/jemalloc/jemalloc/pull/192 but in > the process I found a few interesting things that I thought are worth > mentioning: > > - Several changesets between 3.6 and current dev made the number of > instructions as reported by perf stat on GNU/Linux x86-64 increase > significantly, on a ~200k alloc/dealloc testcase that does nothing > else[1]: > - 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf made the count go from > 69M to 76M. > - 6ef80d68f092caf3b3802a73b8d716057b41864c from 76M to 81.5M > - 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 from 81.5M to 85M > - 155bfa7da18cab0d21d87aa2dce4554166836f5d from 85M to 88M > I didn't investigate further because it was a red herring as far as > the regression I was tracking was concerned. > > - The average number of mutex lock per alloc/dealloc is close to 1 with > mozjemalloc (1.001), but 1.13 with jemalloc 3 (same testcase as above). > Fortunately, contention is likely lower (I measured it to be lower, but > the instrumentation had so much overhead that it may have skewed the > results), but pthread_mutex_lock/unlock are not free as far as > instruction count is concerned. > > Cheers, > > Mike
You can speed up locking/unlocking by ~10-20% by dropping a lighter mutex implementation. Here's a simple C11 implementation based on Drepper's futex paper, for example: https://github.com/thestinger/allocator/blob/master/mutex.h https://github.com/thestinger/allocator/blob/master/mutex.c It would be easy enough to add (adaptive) spinning to lock/unlock just like the glibc adaptive mutex that's currently used by jemalloc. Implementing great load balancing for arenas would greatly reduce the benefits of fine-grained locking. The best approach that I've come up with is the following: * 1 arena per core, rather than 4 arenas per core * assign the initial threads via round-robin, until each arena is used * when there are no unused arenas, switch to sched_getcpu() * store the thread ID of the last thread to allocate in the arena The algorithm for picking an arena for allocating: if thread.last_arena.last_allocator == thread.id && trylock() != fail pass else: pick_arena_with_sched_getcpu() lock() set_last_allocator() This results in significantly better load balancing than jemalloc has at the moment while using 1/4 as many arenas.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ jemalloc-discuss mailing list jemalloc-discuss@canonware.com http://www.canonware.com/mailman/listinfo/jemalloc-discuss