On 3/18/19 12:48 , Bob Friesenhahn wrote: > On Mon, 18 Mar 2019, Bob Friesenhahn wrote: > >> Based on my understanding of plockstat output (reported times are >> described as "Average duration of an event, in nanoseconds"), libumem >> is taking 28.6, 47.8, or 67.1 *milliseconds* per contested lock for >> memory requests related to posix_memalign(). This is an astonishing >> amount of time. > > If I remove use of posix_memalign() and instead use an aligned memory > implementation based on malloc() with an added header used by a special > matching freeing function, then the longest umem-related time I see is > vmem_sbrk_alloc, which takes 8,052,736 nanoseconds or sometimes > 23,555,227 nanoseconds. Otherwise, umem is largely gone from the > radar. Contention issues due to design issues in my own code then > become the prominent ones.
Hi Bob, Thanks for the additional detail here. So, what's happening here is that normally umem tries to satisfy allocations from different caches which are designed to handle high concurrency and also use the per-thread caches for smaller sized allocations. Allocations that can't be satisfied by the caches due to size are instead allocated through the vmem interfaces you see popping up. Those do not have the same scalability properties of the rest of libumem, which is a problem. Now, the reason that it's much worse is due to the implementation of memalign() in libumem. What happens is that it tries to see if the alignment is satisfied by malloc's default alignment in umem and if not, falls back to using the vmem interfaces. This is pretty unfortunate, as it's pretty clear that in a lot of cases it'd be much better to just use a larger allocation and align inside of it ourselves, much like I suspect your own wrapper is doing. I think improving the memalign performance here for the common cases that fit within our normal umem cache size would be helpful. I expect most folks calling memalign are going for cache-line or page-sized allocations. As our upper bound on caches is 128k, I expect that'd help us get out of the way and at least be a straightforward and simple improvement. For larger allocations, we'll likely need to figure out how to break up the vmem backend so we can make that more parallelized on different CPUs. Robert ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-Maeb2f125a7191ebb8e638c04 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
