Re: [discuss] libc/libumem lack of concurrency

Robert Mustacchi Mon, 18 Mar 2019 17:45:42 -0700

On 3/18/19 12:48 , Bob Friesenhahn wrote:
> On Mon, 18 Mar 2019, Bob Friesenhahn wrote:
> 
>> Based on my understanding of plockstat output (reported times are
>> described as "Average duration of an event, in nanoseconds"), libumem
>> is taking 28.6, 47.8, or 67.1 *milliseconds* per contested lock for
>> memory requests related to posix_memalign().  This is an astonishing
>> amount of time.
> 
> If I remove use of posix_memalign() and instead use an aligned memory
> implementation based on malloc() with an added header used by a special
> matching freeing function, then the longest umem-related time I see is
> vmem_sbrk_alloc, which takes 8,052,736 nanoseconds or sometimes
> 23,555,227 nanoseconds.  Otherwise, umem is largely gone from the
> radar.  Contention issues due to design issues in my own code then
> become the prominent ones.


Hi Bob,

Thanks for the additional detail here.

So, what's happening here is that normally umem tries to satisfy
allocations from different caches which are designed to handle high
concurrency and also use the per-thread caches for smaller sized
allocations. Allocations that can't be satisfied by the caches due to
size are instead allocated through the vmem interfaces you see popping
up. Those do not have the same scalability properties of the rest of
libumem, which is a problem.

Now, the reason that it's much worse is due to the implementation of
memalign() in libumem. What happens is that it tries to see if the
alignment is satisfied by malloc's default alignment in umem and if not,
falls back to using the vmem interfaces. This is pretty unfortunate, as
it's pretty clear that in a lot of cases it'd be much better to just use
a larger allocation and align inside of it ourselves, much like I
suspect your own wrapper is doing. I think improving the memalign
performance here for the common cases that fit within our normal umem
cache size would be helpful.

I expect most folks calling memalign are going for cache-line or
page-sized allocations. As our upper bound on caches is 128k, I expect
that'd help us get out of the way and at least be a straightforward and
simple improvement.

For larger allocations, we'll likely need to figure out how to break up
the vmem backend so we can make that more parallelized on different CPUs.

Robert

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-Maeb2f125a7191ebb8e638c04
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Re: [discuss] libc/libumem lack of concurrency

Reply via email to