On Mon, 18 Mar 2019, Robert Mustacchi wrote:
Hi Bob,
Thanks for digging into this. It's useful to have other takes on this. I
see these all as reasons that we should look at improving libumem. So a
few notes:
1) libumem isn't the default allocator, but a number of things link
against it so it can easily end up being pulled in. The default libc
allocator is even worse in a multi-threaded environment.
2) Right now when the alignment is a bit larger via
memalign/posix_memalign, we end up bypassing the traditional umem
caches, which can make that end up performing poorer.
My objective with the buffer alignment request is to avoid cache-line
thrashing, and also to provide an opportunity for SSE2 type code to
work. The allocation size is also rounded up to the cache line size.
Linux malloc already appears to provide the desired alignment by
default but Solaris malloc has been observed to be more space
efficient, leading to unexpected and unpredictable cache-line
thrashing.
I do not need to use posix_memalign() since I have a good work-around
based on malloc()/free().
3) Based on your analysis of the different algorithms in use, would it
be possible to synthesize a bit more of a microbenchmark that describes
the various allocation and free patterns that are going on? That might
help us understand that a little bit better and see where we can
generally improve things.
I am still working to understand the issue myself. There are really
very few allocations going on over a ten second run, but sometimes
there could be 40 allocation requests (e.g. due to a 40-thread system)
arriving at the same time, each due to a different thread.
I used a tool called 'plockstat' to look at locking and for the
problem cases, libumem is taking (by far) most of the time. Here are
the high-runner cases until finally reaching a lock in my own
application, a lock in the dynamic loader, and a lock used by GCC's
gomp:
Mutex block
-------------------------------------------------------------------------------
Count nsec Lock Caller
1906 14758238 0xa46030 libumem.so.1`vmem_xalloc+0xfc
nsec ---- Time Distribution --- count Stack
4096 |@ | 121 libc.so.1`mutex_lock_impl+0x189
8192 |@@ | 182 libc.so.1`mutex_lock+0x13
16384 | | 9 libumem.so.1`vmem_xalloc+0xfc
32768 |@ | 118 libumem.so.1`memalign+0xb0
65536 |@ | 88 libc.so.1`posix_memalign+0x41
131072 | | 59 gm`MagickMallocAligned+0x38
262144 |@ | 96 gm`AllocateCacheNexus+0x13
524288 |@ | 123 gm`AcquireCacheNexus+0x137
1048576 |@@ | 168 gm`AcquireCacheViewPixels+0x6c
2097152 |@@ | 195 gm`InterpolateViewColor+0x42
4194304 |@ | 143 gm`WaveImage._omp_fn.4+0x166
8388608 |@ | 151
16777216 |@@ | 162
33554432 |@ | 107
67108864 |@ | 109
134217728 | | 61
268435456 | | 14
-------------------------------------------------------------------------------
Count nsec Lock Caller
1587 16720081 0xa46030 libumem.so.1`vmem_xfree+0x3e
nsec ---- Time Distribution --- count Stack
2048 | | 2 libc.so.1`mutex_lock_impl+0x189
4096 |@ | 84 libc.so.1`mutex_lock+0x13
8192 |@@ | 155 libumem.so.1`vmem_xfree+0x3e
16384 | | 11 libumem.so.1`process_free+0x122
32768 |@ | 104 libumem.so.1`umem_malloc_free+0x1d
65536 |@ | 83 gm`AcquireCacheNexus+0x2fa
131072 | | 63 gm`AcquireCacheViewPixels+0x6c
262144 |@ | 70 gm`InterpolateViewColor+0x42
524288 |@ | 86 gm`WaveImage._omp_fn.4+0x166
1048576 |@@ | 153
libgomp.so.1.0.0`gomp_thread_start+0x18d
2097152 |@@ | 141 libc.so.1`_thrp_setup+0x8a
4194304 |@ | 118
8388608 |@ | 122
16777216 |@ | 131
33554432 |@ | 96
67108864 |@ | 90
134217728 | | 63
268435456 | | 13
536870912 | | 2
-------------------------------------------------------------------------------
Count nsec Lock Caller
98 13914404 0xa46030 libumem.so.1`vmem_xfree+0x3e
nsec ---- Time Distribution --- count Stack
4096 |@ | 5 libc.so.1`mutex_lock_impl+0x189
8192 |@@@ | 14 libc.so.1`mutex_lock+0x13
16384 | | 2 libumem.so.1`vmem_xfree+0x3e
32768 |@@ | 9 libumem.so.1`process_free+0x122
65536 |@ | 6 libumem.so.1`umem_malloc_free+0x1d
131072 | | 3 gm`AcquireCacheNexus+0x2fa
262144 | | 3 gm`AcquireCacheViewPixels+0x6c
524288 | | 3 gm`InterpolateViewColor+0x42
1048576 |@@ | 9 gm`WaveImage._omp_fn.4+0x166
2097152 |@ | 6 libgomp.so.1.0.0`GOMP_parallel+0x40
4194304 |@ | 7 gm`WaveImage+0x185
8388608 |@ | 8
16777216 |@ | 6
33554432 |@ | 6
67108864 |@ | 8
134217728 | | 3
-------------------------------------------------------------------------------
Count nsec Lock Caller
9 3881187 0xa46030 libumem.so.1`vmem_xalloc+0xfc
nsec ---- Time Distribution --- count Stack
16384 |@@@@@ | 2 libc.so.1`mutex_lock_impl+0x189
32768 |@@ | 1 libc.so.1`mutex_lock+0x13
65536 |@@@@@ | 2 libumem.so.1`vmem_xalloc+0xfc
131072 |@@ | 1 libumem.so.1`memalign+0xb0
262144 | | 0 libc.so.1`posix_memalign+0x41
524288 | | 0 gm`MagickMallocAligned+0x38
1048576 |@@ | 1 gm`SetNexus+0x55d
2097152 | | 0 gm`AcquireCacheNexus+0xd1
4194304 | | 0 gm`AcquireCacheViewPixels+0x6c
8388608 | | 0 gm`InterpolateViewColor+0x42
16777216 |@@@@@ | 2 gm`WaveImage._omp_fn.4+0x166
-------------------------------------------------------------------------------
Count nsec Lock Caller
9 430535 libc.so.1`_uberdata+0x2a20 libc.so.1`_lwp_start
nsec ---- Time Distribution --- count Stack
8192 |@@ | 1 libc.so.1`lmutex_lock+0xf8
16384 | | 0 libc.so.1`tls_setup+0x72
32768 |@@@@@ | 2 libc.so.1`_thrp_setup+0x55
65536 | | 0 libc.so.1`_lwp_start
131072 |@@ | 1
262144 | | 0
524288 |@@@@@@@@ | 3
1048576 |@@@@@ | 2
-------------------------------------------------------------------------------
Count nsec Lock Caller
4 296960 0xb351c0 gm`LockSemaphoreInfo+0x3d
nsec ---- Time Distribution --- count Stack
8192 |@@@@@@ | 1 libc.so.1`mutex_lock_impl+0x189
16384 | | 0 libc.so.1`mutex_lock+0x13
32768 | | 0 gm`LockSemaphoreInfo+0x3d
65536 | | 0 gm`ModifyCache+0x56
131072 |@@@@@@ | 1 gm`SetCacheNexus+0x5c
262144 | | 0 gm`SetCacheViewPixels+0x6c
524288 |@@@@@@@@@@@@ | 2
gm`ConstituteTextureImage._omp_fn.0+0xbf
libgomp.so.1.0.0`gomp_thread_start+0x18d
libc.so.1`_thrp_setup+0x8a
libc.so.1`_lwp_start
-------------------------------------------------------------------------------
Count nsec Lock Caller
1 1048576 0xa46030 libumem.so.1`vmem_xalloc+0x41d
nsec ---- Time Distribution --- count Stack
1048576 |@@@@@@@@@@@@@@@@@@@@@@@@| 1 libc.so.1`mutex_lock_impl+0x189
libc.so.1`mutex_lock+0x13
libumem.so.1`vmem_xalloc+0x41d
libumem.so.1`memalign+0xb0
libc.so.1`posix_memalign+0x41
gm`MagickMallocAligned+0x38
gm`SetNexus+0x55d
gm`AcquireCacheNexus+0xd1
gm`AcquireCacheViewPixels+0x6c
gm`InterpolateViewColor+0x42
gm`WaveImage._omp_fn.4+0x166
-------------------------------------------------------------------------------
Count nsec Lock Caller
4 114688 libc.so.1`_uberdata+0x2a20
libgomp.so.1.0.0`gomp_thread_start+0x24
nsec ---- Time Distribution --- count Stack
32768 |@@@@@@@@@@@@ | 2 libc.so.1`lmutex_lock+0xf8
65536 | | 0 libc.so.1`slow_tls_get_addr+0x49
131072 |@@@@@@ | 1
libgomp.so.1.0.0`gomp_thread_start+0x24
262144 |@@@@@@ | 1 libc.so.1`_thrp_setup+0x8a
libc.so.1`_lwp_start
--
Bob Friesenhahn
[email protected], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Public Key, http://www.simplesystems.org/users/bfriesen/public-key.txt
------------------------------------------
illumos: illumos-discuss
Permalink:
https://illumos.topicbox.com/groups/discuss/T30dd2eceb8a069b3-M4bee0e5296e27efb49ba1aaf
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription