David Boreham wrote:
Howard Chu wrote:
Kinda interesting - with hoard this shows us processing 23000 entries
per second in the single-threaded case, vs only 3521 per second with
glibc malloc.
It is possible that you are seeing the syndrome that I wrote about here:
http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html
AFAIK the poorly behaving code is still present in today's glibc malloc.
(the problem affects UMich-derived LDAP servers because entries in
the cache comprise a significant proportion of the heap traffic, and they
tend to get allocated by a different thread than frees them when the
cache fills up). malloc exhibits lock contention when blocks are freed
by a different thread than allocated them (more exactly when the threads
hash to different arenas).
Yes, that's a factor I had considered. One of the approaches I had
tested was tagging each entry with the threadID that allocated it, and
deferring the frees until the owning thread comes back to dispose of
them. I wasn't really happy with this approach because it meant the
entry cache size would not be strictly regulated, though on a busy
server it's likely that all of the frees would happen in reasonably
short time.
Still, free() contention alone doesn't explain the huge performance gap
in the single-threaded case. However, judging from the fact that the
very first glibc search often runs faster than the second and subsequent
ones, I'd say that there are other significant costs in glibc free().
Most likely would be that glibc returns memory back to the OS too
frequently. Hoard and Umem also return memory back to the OS, but they
also keep more extensive free lists. Tcmalloc never returns memory back
to the OS.
What's also interesting is that for Hoard, Umem, and Tcmalloc, the
multi-threaded query times are consistently about 2x slower than the
single-threaded case. The 2x slowdown makes sense since it's only a
dual-core CPU and it's doing 4x as much work. This kinda says that the
cost of malloc is overshadowed by the overhead of thread scheduling.
But for glibc the multi-threaded times are actually faster than the
single-threaded case. Since glibc's costs are being partially hidden it
seems that the multiple threads are actually getting some benefit from
the entry cache here. Still the difference between glibc and the other
allocators is so dramatic that this little difference is just academic.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/