Re: [discuss] libc/libumem lack of concurrency

Bob Friesenhahn Sun, 17 Mar 2019 07:16:24 -0700

On Sat, 16 Mar 2019, Joshua M. Clulow wrote:

On Sat, 16 Mar 2019 at 12:09, Bob Friesenhahn
<[email protected]> wrote:

Using the default allocator:
% gm benchmark -duration 10 convert -size 4000x3000 tile:model.pnm -wave 25x150 
null:
Results: 40 threads 13 iter 74.63s user 10.122100s total 1.284 iter/s 0.174 
iter/cpu


Using libumem:
% LD_PRELOAD_64=libumem.so.1 gm benchmark -duration 10 convert -size 4000x3000 
tile:model.pnm -wave 25x150 null:
Results: 40 threads 13 iter 77.28s user 10.226807s total 1.271 iter/s 0.168 
iter/cpu

Using mtmalloc:
% LD_PRELOAD_64=libmtmalloc.so.1 gm benchmark -duration 10 convert -size 
4000x3000 tile:model.pnm -wave 25x150 null:
Results: 40 threads 64 iter 246.82s user 10.148286s total 6.306 iter/s 0.259 
iter/cpu


Why was the last test 64 iterations instead of 13 like the others?


I should have explained what the "Results" string indicates.

This form of benchmark runs a specified sub-command in a loop for agiven duration (10 seconds in this case) and then reports how manyiterations were accomplished. The difference between libc/libumem andmtmalloc is 13 iterations vs 64 iterations. More iterations isbetter.

It can be seen that the more performant case used a lot more usertime, indicating it was doing more useful work rather than waiting onlocks.

Each iteration does what the 'utility' program does so it isallocating the buffers needed, and then deallocating the buffers foreach iteration rather than re-using buffers across iterations.

Is this huge difference in performance due to mtmalloc expected?  I
thought that modern libumem was supposed to make up most of the
difference.


Do you know if the umem per-thread caching stuff is working here?  It
was originally added in:

   https://www.illumos.org/issues/4489

According to umem_alloc(3MALLOC) you can tune the per-thread cache
size with the UMEM_OPTIONS environment variable, and you can measure
various statistics by taking a core at an appropriate moment and using
"::umastat" from mdb.

I am using OmniOS (omnios-r151026) so it seems like it should be newenough for this feature to be available. I will investigate.

While I have tried to minimize the total number of allocationsrequired by the software (unlike typical Linux application software),each worker thread does perform its own allocations. This means thatwhen an OpenMP loop is engaged that the worker threads thenimmediately try to allocate a buffer of a similar size at almostexactly the same time. The same buffer is then re-used thousands oftimes. For the particular algorithm being tested, there are 4064-byte allocations being performed, each by a different thread.

The software uses posix_memalign() for many of these working buffersif it is available, or uses a fallback based on malloc() if it is not.I have already verified that performance without usingposix_memalign() is similar.

The software was profiled and carefully optimized on a Solaris 10system (Sun Ultra-40 M2) with 4 AMD Opteron cores in the 2008/2009time-frame and still runs well on that system. Solaris (and nowIllumos) has always been my primary development environment, althoughthere is also plenty of testing under Linux where it has always donebetter than under Illumos.


Bob
--
Bob Friesenhahn
[email protected], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
Public Key,     http://www.simplesystems.org/users/bfriesen/public-key.txt

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M6ef732485242833275f3f2a2
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Re: [discuss] libc/libumem lack of concurrency

Reply via email to