On Sat, 16 Mar 2019, Joshua M. Clulow wrote:
On Sat, 16 Mar 2019 at 12:09, Bob Friesenhahn
<[email protected]> wrote:
Using the default allocator:
% gm benchmark -duration 10 convert -size 4000x3000 tile:model.pnm -wave 25x150
null:
Results: 40 threads 13 iter 74.63s user 10.122100s total 1.284 iter/s 0.174
iter/cpu
Using libumem:
% LD_PRELOAD_64=libumem.so.1 gm benchmark -duration 10 convert -size 4000x3000
tile:model.pnm -wave 25x150 null:
Results: 40 threads 13 iter 77.28s user 10.226807s total 1.271 iter/s 0.168
iter/cpu
Using mtmalloc:
% LD_PRELOAD_64=libmtmalloc.so.1 gm benchmark -duration 10 convert -size
4000x3000 tile:model.pnm -wave 25x150 null:
Results: 40 threads 64 iter 246.82s user 10.148286s total 6.306 iter/s 0.259
iter/cpu
Why was the last test 64 iterations instead of 13 like the others?
I should have explained what the "Results" string indicates.
This form of benchmark runs a specified sub-command in a loop for a
given duration (10 seconds in this case) and then reports how many
iterations were accomplished. The difference between libc/libumem and
mtmalloc is 13 iterations vs 64 iterations. More iterations is
better.
It can be seen that the more performant case used a lot more user
time, indicating it was doing more useful work rather than waiting on
locks.
Each iteration does what the 'utility' program does so it is
allocating the buffers needed, and then deallocating the buffers for
each iteration rather than re-using buffers across iterations.
Is this huge difference in performance due to mtmalloc expected? I
thought that modern libumem was supposed to make up most of the
difference.
Do you know if the umem per-thread caching stuff is working here? It
was originally added in:
https://www.illumos.org/issues/4489
According to umem_alloc(3MALLOC) you can tune the per-thread cache
size with the UMEM_OPTIONS environment variable, and you can measure
various statistics by taking a core at an appropriate moment and using
"::umastat" from mdb.
I am using OmniOS (omnios-r151026) so it seems like it should be new
enough for this feature to be available. I will investigate.
While I have tried to minimize the total number of allocations
required by the software (unlike typical Linux application software),
each worker thread does perform its own allocations. This means that
when an OpenMP loop is engaged that the worker threads then
immediately try to allocate a buffer of a similar size at almost
exactly the same time. The same buffer is then re-used thousands of
times. For the particular algorithm being tested, there are 40
64-byte allocations being performed, each by a different thread.
The software uses posix_memalign() for many of these working buffers
if it is available, or uses a fallback based on malloc() if it is not.
I have already verified that performance without using
posix_memalign() is similar.
The software was profiled and carefully optimized on a Solaris 10
system (Sun Ultra-40 M2) with 4 AMD Opteron cores in the 2008/2009
time-frame and still runs well on that system. Solaris (and now
Illumos) has always been my primary development environment, although
there is also plenty of testing under Linux where it has always done
better than under Illumos.
Bob
--
Bob Friesenhahn
[email protected], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Public Key, http://www.simplesystems.org/users/bfriesen/public-key.txt
------------------------------------------
illumos: illumos-discuss
Permalink:
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M6ef732485242833275f3f2a2
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription