Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit : > On 12/8/21 16:51, Ronan Dunklau wrote: > > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > >> And now comes the funny part - if I run it in the same backend as the > >> > >> "full" benchmark, I get roughly the same results: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806256640 | 37159 | 76669 > >> > >> but if I reconnect and run it in the new backend, I get this: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ------------+------------+---------------+----------+--------- > >> > >> 32768 | 512 | 806158336 | 233909 | 100785 > >> > >> (1 row) > >> > >> It does not matter if I wait a bit before running the query, if I run it > >> repeatedly, etc. The machine is not doing anything else, the CPU is set > >> to use "performance" governor, etc. > > > > I've reproduced the behaviour you mention. > > I also noticed asm_exc_page_fault showing up in the perf report in that > > case. > > > > Running an strace on it shows that in one case, we have a lot of brk > > calls, > > while when we run in the same process as the previous tests, we don't. > > > > My suspicion is that the previous workload makes glibc malloc change it's > > trim_threshold and possibly other dynamic options, which leads to > > constantly moving the brk pointer in one case and not the other. > > > > Running your fifo test with absurd malloc options shows that indeed that > > might be the case (I needed to change several, because changing one > > disable the dynamic adjustment for every single one of them, and malloc > > would fall back to using mmap and freeing it on each iteration): > > > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > > > I get the following results for your self contained test. I ran the query > > twice, in each case, seeing the importance of the first allocation and the > > subsequent ones: > > > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 300156 | 207557 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 211942 | 77207 > > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 219000 | 36223 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ------------+------------+---------------+----------+--------- > > > > 32768 | 512 | 795836416 | 75761 | 78082 > > > > (1 row) > > > > I can't tell how representative your benchmark extension would be of real > > life allocation / free patterns, but there is probably something we can > > improve here. > > Thanks for looking at this. I think those allocation / free patterns are > fairly extreme, and there probably are no workloads doing exactly this. > The idea is the actual workloads are likely some combination of these > extreme cases. > > > I'll try to see if I can understand more precisely what is happening. > > Thanks, that'd be helpful. Maybe we can learn something about tuning > malloc parameters to get significantly better performance. >
Apologies for the long email, maybe what I will state here is obvious for others but I learnt a lot, so... I think I understood what the problem was in your generation tests: depending on the sequence of allocations we will raise a different maximum for mmap_threshold and trim_threshold. When an mmap block is freed, malloc will raise it's threshold as it consider memory freed to be better served by regular moving of the sbrk pointer, instead of using mmap to map memory. This threshold is upped by multiplying it by two anytime we free a mmap'ed block. At the same time, the trim_threshold is raised to double the mmap_threshold, considering that this amount of memory should not be released to the OS because we have a good chance of reusing it. This can be demonstrated using the attached systemtap script, along with a patch adding new traces to generation_context for this purpose: When running your query: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(1000000, block_size, chunk_size, 2*chunk_size, 100, 10000, 5000) x; We obtain the following trace for thresholds adjustments: 2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes When running the full benchmark, we reach a higher threshold at some point: 2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes 2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes 2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes 2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes 2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes 2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes This is because at some point in the full benchmark, we allocate a block bigger than mmap threshold, which malloc serves by an mmap, then frees it which means we now also raise the trim_threshold. The subsequent allocations in the lone query end up between those thresholds, so are served by moving the sbrk pointer, then releasing the memory back to the OS, which turns out to be expensive too. One thing that I observed is that that effect of constantly moving the sbrk pointer still happens, but not as much in a real tuplesort workload. I haven't tested the reordering buffer for now, which is the other part of the code using generation context. So, I decided to benchmark trying to control the malloc thresholds. This result is on my local laptop, with an I7 processor, for the benchmark initially proposed by David, with 4GB work_mem and default settings. I benchmarked master, the original patch with fixed block size, and the one adjusting the size. For each of them, I ran them using default malloc options (ie, adjusting the thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this value (ie, 64MB). I did that by setting the corresponding env variables. The results is in the attached spreadsheet. I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. As of now, my conclusion is that for glibc malloc, the block size we use doesn't really matter, as long as we tell malloc to not release a certain amount of memory back to the system once it's been allocated. Setting the mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold to two times that would IMO take us to where a long-lived backend would likely end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to the system, and save us a huge amount a syscalls in common cases. Doing that will not change the allocation profiles for other platforms, and should be safe for those. The only problem I see if we were to do that would be for allocations in excess of work_mem, which would no longer trigger a threshold raise since once we're in "non-dynamic" mode, glibc's malloc would keep our manually-set values. I guess this proposal could be refined to set it up dynamically ourselves when pfreeing blocks, just like malloc does. What do you think of this analysis and idea ? -- Ronan Dunklau
commit eb883bc52d6938d965870714e513ec00d18b05b0 Author: Ronan Dunklau <ronan.dunk...@aiven.io> Date: Fri Dec 10 15:51:54 2021 +0100 Add tracing to generation context diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index 2c98877953..b61a9c024d 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -41,6 +41,7 @@ #include "lib/ilist.h" #include "utils/memdebug.h" #include "utils/memutils.h" +#include "pg_trace.h" #define Generation_BLOCKHDRSZ MAXALIGN(sizeof(GenerationBlock)) @@ -278,7 +279,7 @@ GenerationContextCreate(MemoryContext parent, &GenerationMethods, parent, name); - + TRACE_POSTGRESQL_GENERATION_ALLOC_CREATE(); return (MemoryContext) set; } @@ -313,8 +314,9 @@ GenerationReset(MemoryContext context) #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block->blksize); #endif - + TRACE_POSTGRESQL_GENERATION_ALLOC_FREE(block->blksize); free(block); + } set->block = NULL; @@ -335,6 +337,7 @@ GenerationDelete(MemoryContext context) { /* Reset to release all the GenerationBlocks */ GenerationReset(context); + TRACE_POSTGRESQL_GENERATION_ALLOC_DESTROY(); /* And free the context header */ free(context); } @@ -458,6 +461,7 @@ GenerationAlloc(MemoryContext context, Size size) blksize <<= 1; block = (GenerationBlock *) malloc(blksize); + TRACE_POSTGRESQL_GENERATION_ALLOC_MALLOC(blksize); if (block == NULL) return NULL; @@ -596,6 +600,7 @@ GenerationFree(MemoryContext context, void *pointer) dlist_delete(&block->node); context->mem_allocated -= block->blksize; + TRACE_POSTGRESQL_GENERATION_ALLOC_FREE(block->blksize); free(block); } diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d index b0c50a3c7f..0531ff03c8 100644 --- a/src/backend/utils/probes.d +++ b/src/backend/utils/probes.d @@ -91,4 +91,10 @@ provider postgresql { probe wal__switch(); probe wal__buffer__write__dirty__start(); probe wal__buffer__write__dirty__done(); + + probe generation_alloc_malloc(Size); + probe generation_alloc_free(Size); + + probe generation_alloc_create(); + probe generation_alloc_destroy(); };
#! /usr/bin/env stap # http://developerblog.redhat.com/2015/01/06/malloc-systemtap-probes-an-example/ global sbrk, waits, arenalist, mmap_threshold = 131072, heaplist global trim_threshold = 0 global context_creation_sbrk = 0 global cnt_sbrk_more = 0 global cnt_sbrk_less = 0 global palloced_mem = 0 global pfreed_mem = 0 global sum_sbrk_more = 0 # sbrk accounting probe process("/lib*/libc.so.6").mark("memory_sbrk_more") { sbrk += $arg2 cnt_sbrk_more += 1 sum_sbrk_more += $arg2 if ($arg2 < 0) { printf("sbrk_more called with %ld", $arg2) } } probe process("/lib*/libc.so.6").mark("memory_sbrk_less") { sbrk -= $arg2 cnt_sbrk_less += 1 sum_sbrk_less += $arg2 } # threshold tracking probe process("/lib*/libc.so.6").mark("memory_mallopt_free_dyn_thresholds") { if ($arg1 != mmap_threshold || $arg2 != trim_threshold) { printf("%d: New thresholds: mmap: %ld bytes, trim: %ld bytes\n", tid(), $arg1, $arg2) } mmap_threshold = $arg1 trim_threshold = $arg2 } probe process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_malloc") { palloced_mem += $arg1 } probe process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_free") { pfreed_mem += $arg1 } probe process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_create") { palloced_mem = 0; pfreed_mem = 0; cnt_sbrk_more = 0; cnt_sbrk_less = 0; sum_sbrk_more = 0; context_creation_sbrk = sbrk; } probe process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_destroy") { printf("Palloced mem in generation: %ld\n", palloced_mem); printf("Pfreed mem in generation: %ld\n", pfreed_mem); printf("Sbrk moved: %ld\n", sbrk - context_creation_sbrk); printf("Sbrk calls: More: %d Less: %d\n", cnt_sbrk_more, cnt_sbrk_less); printf("Sbrk bytes more: %d\n", sum_sbrk_more); } # reporting probe begin { if (target() == 0) error ("please specify target process with -c / -x") }
bench_results_5tests.ods
Description: application/vnd.oasis.opendocument.spreadsheet