On Sat, 16 Mar 2019, Joshua M. Clulow wrote:

On Sat, 16 Mar 2019 at 12:09, Bob Friesenhahn
<[email protected]> wrote:
Using the default allocator:
% gm benchmark -duration 10 convert -size 4000x3000 tile:model.pnm -wave 25x150 
null:
Results: 40 threads 13 iter 74.63s user 10.122100s total 1.284 iter/s 0.174 
iter/cpu

Using libumem:
% LD_PRELOAD_64=libumem.so.1 gm benchmark -duration 10 convert -size 4000x3000 
tile:model.pnm -wave 25x150 null:
Results: 40 threads 13 iter 77.28s user 10.226807s total 1.271 iter/s 0.168 
iter/cpu

Using mtmalloc:
% LD_PRELOAD_64=libmtmalloc.so.1 gm benchmark -duration 10 convert -size 
4000x3000 tile:model.pnm -wave 25x150 null:
Results: 40 threads 64 iter 246.82s user 10.148286s total 6.306 iter/s 0.259 
iter/cpu

Why was the last test 64 iterations instead of 13 like the others?

I should have explained what the "Results" string indicates.

This form of benchmark runs a specified sub-command in a loop for a given duration (10 seconds in this case) and then reports how many iterations were accomplished. The difference between libc/libumem and mtmalloc is 13 iterations vs 64 iterations. More iterations is better.

It can be seen that the more performant case used a lot more user time, indicating it was doing more useful work rather than waiting on locks.

Each iteration does what the 'utility' program does so it is allocating the buffers needed, and then deallocating the buffers for each iteration rather than re-using buffers across iterations.

Is this huge difference in performance due to mtmalloc expected?  I
thought that modern libumem was supposed to make up most of the
difference.

Do you know if the umem per-thread caching stuff is working here?  It
was originally added in:

   https://www.illumos.org/issues/4489

According to umem_alloc(3MALLOC) you can tune the per-thread cache
size with the UMEM_OPTIONS environment variable, and you can measure
various statistics by taking a core at an appropriate moment and using
"::umastat" from mdb.

I am using OmniOS (omnios-r151026) so it seems like it should be new enough for this feature to be available. I will investigate.

While I have tried to minimize the total number of allocations required by the software (unlike typical Linux application software), each worker thread does perform its own allocations. This means that when an OpenMP loop is engaged that the worker threads then immediately try to allocate a buffer of a similar size at almost exactly the same time. The same buffer is then re-used thousands of times. For the particular algorithm being tested, there are 40 64-byte allocations being performed, each by a different thread.

The software uses posix_memalign() for many of these working buffers if it is available, or uses a fallback based on malloc() if it is not. I have already verified that performance without using posix_memalign() is similar.

The software was profiled and carefully optimized on a Solaris 10 system (Sun Ultra-40 M2) with 4 AMD Opteron cores in the 2008/2009 time-frame and still runs well on that system. Solaris (and now Illumos) has always been my primary development environment, although there is also plenty of testing under Linux where it has always done better than under Illumos.

Bob
--
Bob Friesenhahn
[email protected], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
Public Key,     http://www.simplesystems.org/users/bfriesen/public-key.txt

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M6ef732485242833275f3f2a2
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to