Yesterday I did a full GraphicsMagick benchmark run using builds linked with libumem and libmtmalloc.

Yesterday I was focusing on just one problematic algorithm and libmtmalloc does 4X better than libumem for that algorithm. I also found that the OpenMP scheduling algorithm used made a huge difference. Using static scheduling (which worked great on Solaris 10) causes a performance problem on Illumos with 20 cores (40 threads) because it demands that all threads process their data, but 'guided' scheduling (which is adaptive and thus less sensitive to latencies) works better.

The full run reveals that on average, performance is better using libumem.

I found that one algorithm achieves no speed-up at all with umem, but achieves a speed-up of 10.57X when using mtmalloc. There is low CPU use when there is no speed-up, which seems to rule out CPU-level contention such as cache-line thrashing.

The sensitive "canary" algorithms all access small bits of allocated data (e.g. 32 bytes) at a time, and there are a couple of locks involved as well.

It feels like there is some sort of priority inversion or scheduling issue going on which is sensitive to the memory allocator used.

It is doubtful that the GNU GCC developers spend much effort with tuning pthreads-based OpenMP under Illumos or Solaris.

Bob
--
Bob Friesenhahn
[email protected], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
Public Key,     http://www.simplesystems.org/users/bfriesen/public-key.txt

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M2ad789cd4956ec29857b4b82
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to