Re: [discuss] libc/libumem lack of concurrency

Robert Mustacchi Mon, 18 Mar 2019 10:15:02 -0700

On 3/18/19 6:55 , Bob Friesenhahn wrote:
> Yesterday I did a full GraphicsMagick benchmark run using builds linked
> with libumem and libmtmalloc.
> 
> Yesterday I was focusing on just one problematic algorithm and
> libmtmalloc does 4X better than libumem for that algorithm.  I also
> found that the OpenMP scheduling algorithm used made a huge difference. 
> Using static scheduling (which worked great on Solaris 10) causes a
> performance problem on Illumos with 20 cores (40 threads) because it
> demands that all threads process their data, but 'guided' scheduling
> (which is adaptive and thus less sensitive to latencies) works better.
> 
> The full run reveals that on average, performance is better using libumem.
> 
> I found that one algorithm achieves no speed-up at all with umem, but
> achieves a speed-up of 10.57X when using mtmalloc.  There is low CPU use
> when there is no speed-up, which seems to rule out CPU-level contention
> such as cache-line thrashing.
> 
> The sensitive "canary" algorithms all access small bits of allocated
> data (e.g. 32 bytes) at a time, and there are a couple of locks involved
> as well.
> 
> It feels like there is some sort of priority inversion or scheduling
> issue going on which is sensitive to the memory allocator used.
> 
> It is doubtful that the GNU GCC developers spend much effort with tuning
> pthreads-based OpenMP under Illumos or Solaris.
Hi Bob,


Thanks for digging into this. It's useful to have other takes on this. I
see these all as reasons that we should look at improving libumem. So a
few notes:

1) libumem isn't the default allocator, but a number of things link
against it so it can easily end up being pulled in. The default libc
allocator is even worse in a multi-threaded environment.

2) Right now when the alignment is a bit larger via
memalign/posix_memalign, we end up bypassing the traditional umem
caches, which can make that end up performing poorer.

3) Based on your analysis of the different algorithms in use, would it
be possible to synthesize a bit more of a microbenchmark that describes
the various allocation and free patterns that are going on? That might
help us understand that a little bit better and see where we can
generally improve things.

Robert

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M92fb6ffc1b8aa363837c1aa1
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Re: [discuss] libc/libumem lack of concurrency

Reply via email to