Hiya,

I've been working for some time on improving the libc malloc in
DragonFly, specifically improving its thread scaling and reducing the
number of mmap/munmap system calls it issues.

To address the issue of scaling, I added per-thread magazines for
allocations < 8K (ones that would hit the current slab zones); the
magazine logic is based straight on Bonwick and Adams's 2001
'Magazines and Vmem' paper, about the Solaris kernel allocator and
libumem. To address the number of mmaps the allocator was making, I
added a single magazine between the slab layer and mmap - it caches up
to 64 zones and will reuse them rather than requesting/releasing to
the system. In addition, I made the first request for a zone allocate
not one, but 8 zones, the second will allocate 6, so on and on, till
we have stabilized at allocating one-at-a-time. This logic was meant
to deal with programs issuing requests for different-sized objects
early on in their life.

Some benchmark results so far:

sh6bench =============================
To quote Jason Evans, 'is a quirky malloc benchmark that has been used
in some of the published malloc literature, so I include it here.
sh6bench does cycles of allocating groups of objects, where the size
of objects in each group is random. Sometimes the objects are held for
a while before being freed.' The test is available at:
http://m-net.arbornet.org/~sv5679/sh6bench.c

When run on DragonFly, with 50000 calls for objects between 1 and 1512
bytes, nmalloc (the current libc allocator) takes 85sec to finish the
test; nmalloc-1.33 takes 58s, spending nearly 20 sec less in system
time.

When tested with 2500 calls, for 1...1512 byte objects on FreeBSD 8,
nmalloc, nmalloc 1.33, and jemalloc (the FreeBSD libc allocator) turn
in times very close to one another.
Here are the total memory uses and mmap call counts:
(nmalloc 'g' is nmalloc with the per-thread caches disabled).

                         mmaps / munmaps                total space 
requested/release     diff
nmalloc         1582 / 438                              107,598,464 b / 
29,220,560 b
78,377,904 b
nmalloc 1.33    1154 / 9                                81,261,328 b / 852,752 
b                80,408,576 b
nmalloc 1.33 'g'        1148 / 9                                81,130,256 b / 
852,752
b               80,277,504 b
jemalloc                 45 / 4                         82,718,411 b / 1,609,424
b               81,108,987 b

I also graphed the allocation structure using Jason Evans's mtrgraph
tool (nmalloc 1.33 can generate utrace events).
nmalloc: http://acm.jhu.edu/~me/heap/sh6bench_nmalloc.png
nmalloc 1.33: http://acm.jhu.edu/~me/heap/sh6bench_nmalloc133.png
jemalloc: http://acm.jhu.edu/~me/heap/sh6bench_jemalloc.png

The horizontal axis of this graph is time in terms of allocation/free
events. The vertical is
address space, split into buckets. Similar traces were generated when
jemalloc was in development:
http://people.freebsd.org/~jasone/jemalloc/plots/

>From these, you can see that nmalloc 1.33 has similar heap structure
and fragmentation characteristics to the original DragonFly allocator;
this thread scaling work has not unduly increased fragmentation at
least.

MySQL sysbench ==================================
I ran sysbench OLTP against a MySQL server on FreeBSD 8 (2-core
Core2Duo 3.0GHz) and varied the number of threads; this is how nmalloc
1.33 fared against nmalloc and jemalloc (in transactions/sec):
http://m-net.arbornet.org/~sv5679/sysbench_nmalloc.gif

Polachok repeated these tests on a 4-core Xeon server (4 core + 4 HTT,
2.6GHz) running DragonFly; here is the improvement, again in
transactions/sec:
http://m-net.arbornet.org/~sv5679/sysbench_nmalloc_df.gif

===================================

I'd appreciate if people could try these allocator improvements out;
to do so, you can either:
1) Grab http://m-net.arbornet.org/~sv5679/nmalloc.c and
http://m-net.arbornet.org/~sv5679/Makefile ;run make, you will get a
shared object that you can LD_PRELOAD.
2) Grab http://m-net.arbornet.org/~sv5679/nmalloc.c and replace
/usr/src/lib/libc/stdlib/nmalloc.c ; rebuild world. Your entire system
will use the new allocator.

Kick it around, see if it is faster; see if it explodes in use.

All of this work is also at
http://gitweb.dragonflybsd.org/~vsrinivas/dragonfly.git, in the
nmalloc_mt branch.

I'd also appreciate any code reviews, if anyone is interested.

Thanks!
-- vs

Reply via email to