On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > 0002: > This modifies the tuplesort API so that instead of having a > randomAccess bool flag,
Per cirrus, this missed updating one line for dtrace. On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > I feel this is a pretty simple patch and if nobody has any objections > then I plan to push all 3 of them on my New Zealand Monday morning. I'll mark the CF as such. -- Justin
>From a6271ac626b561cdafe60152eb4fb208e7fe7240 Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Wed, 16 Feb 2022 10:23:32 +1300 Subject: [PATCH 1/4] Improve the generation memory allocator Here we make a series of improvements to the generation memory allocator: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com --- src/backend/access/gist/gistvacuum.c | 6 + .../replication/logical/reorderbuffer.c | 7 + src/backend/utils/mmgr/generation.c | 383 ++++++++++++++---- src/include/utils/memutils.h | 4 +- 4 files changed, 322 insertions(+), 78 deletions(-) diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c index aac4afab8ff..f190decdff2 100644 --- a/src/backend/access/gist/gistvacuum.c +++ b/src/backend/access/gist/gistvacuum.c @@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, * pages in page_set_context. Internally, the integer set will remember * this context so that the subsequent allocations for these integer sets * will be done from the same context. + * + * XXX the allocation sizes used below pre-date generation context's block + * growing code. These values should likely be benchmarked and set to + * more suitable values. */ vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext, "GiST VACUUM page set context", + 16 * 1024, + 16 * 1024, 16 * 1024); oldctx = MemoryContextSwitchTo(vstate.page_set_context); vstate.internal_page_set = intset_create(); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index c2d9be81fae..4702750a2e7 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -370,8 +370,15 @@ ReorderBufferAllocate(void) SLAB_DEFAULT_BLOCK_SIZE, sizeof(ReorderBufferTXN)); + /* + * XXX the allocation sizes used below pre-date generation context's block + * growing code. These values should likely be benchmarked and set to + * more suitable values. + */ buffer->tup_context = GenerationContextCreate(new_ctx, "Tuples", + SLAB_LARGE_BLOCK_SIZE, + SLAB_LARGE_BLOCK_SIZE, SLAB_LARGE_BLOCK_SIZE); hash_ctl.keysize = sizeof(TransactionId); diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index 95c006f689f..29065201556 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -20,18 +20,15 @@ * * The memory context uses a very simple approach to free space management. * Instead of a complex global freelist, each block tracks a number - * of allocated and freed chunks. Freed chunks are not reused, and once all - * chunks in a block are freed, the whole block is thrown away. When the - * chunks allocated in the same block have similar lifespan, this works - * very well and is very cheap. + * of allocated and freed chunks. The block is classed as empty when the + * number of free chunks is equal to the number of allocated chunks. When + * this occurs, instead of freeing the block, we try to "recycle" it, i.e. + * reuse it for new allocations. This is done by setting the block in the + * context's 'freeblock' field. If the freeblock field is already occupied + * by another free block we simply return the newly empty block to malloc. * - * The current implementation only uses a fixed block size - maybe it should - * adapt a min/max block size range, and grow the blocks automatically. - * It already uses dedicated blocks for oversized chunks. - * - * XXX It might be possible to improve this by keeping a small freelist for - * only a small number of recent blocks, but it's not clear it's worth the - * additional complexity. + * This approach to free blocks requires fewer malloc/free calls for truely + * first allocated, first free'd allocation patterns. * *------------------------------------------------------------------------- */ @@ -39,6 +36,7 @@ #include "postgres.h" #include "lib/ilist.h" +#include "port/pg_bitutils.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -46,6 +44,8 @@ #define Generation_BLOCKHDRSZ MAXALIGN(sizeof(GenerationBlock)) #define Generation_CHUNKHDRSZ sizeof(GenerationChunk) +#define Generation_CHUNK_FRACTION 8 + typedef struct GenerationBlock GenerationBlock; /* forward reference */ typedef struct GenerationChunk GenerationChunk; @@ -60,20 +60,28 @@ typedef struct GenerationContext MemoryContextData header; /* Standard memory-context fields */ /* Generational context parameters */ - Size blockSize; /* standard block size */ - - GenerationBlock *block; /* current (most recently allocated) block */ + Size initBlockSize; /* initial block size */ + Size maxBlockSize; /* maximum block size */ + Size nextBlockSize; /* next block size to allocate */ + Size allocChunkLimit; /* effective chunk size limit */ + + GenerationBlock *block; /* current (most recently allocated) block, or + * NULL if we've just freed the most recent + * block */ + GenerationBlock *freeblock; /* pointer to a block that's being recycled, + * or NULL if there's no such block. */ + GenerationBlock *keeper; /* keep this block over resets */ dlist_head blocks; /* list of blocks */ } GenerationContext; /* * GenerationBlock * GenerationBlock is the unit of memory that is obtained by generation.c - * from malloc(). It contains one or more GenerationChunks, which are + * from malloc(). It contains zero or more GenerationChunks, which are * the units requested by palloc() and freed by pfree(). GenerationChunks * cannot be returned to malloc() individually, instead pfree() * updates the free counter of the block and when all chunks in a block - * are free the whole block is returned to malloc(). + * are free the whole block can be returned to malloc(). * * GenerationBlock is the header data for a block --- the usable space * within the block begins at the next alignment boundary. @@ -143,6 +151,14 @@ struct GenerationChunk #define GenerationChunkGetPointer(chk) \ ((GenerationPointer *)(((char *)(chk)) + Generation_CHUNKHDRSZ)) +/* Inlined helper functions */ +static inline void GenerationBlockInit(GenerationBlock *block, Size blksize); +static inline bool GenerationBlockIsEmpty(GenerationBlock *block); +static inline void GenerationBlockMarkEmpty(GenerationBlock *block); +static inline Size GenerationBlockFreeBytes(GenerationBlock *block); +static inline void GenerationBlockFree(GenerationContext *set, + GenerationBlock *block); + /* * These functions implement the MemoryContext API for Generation contexts. */ @@ -191,14 +207,21 @@ static const MemoryContextMethods GenerationMethods = { * * parent: parent context, or NULL if top-level context * name: name of context (must be statically allocated) - * blockSize: generation block size + * minContextSize: minimum context size + * initBlockSize: initial allocation block size + * maxBlockSize: maximum allocation block size */ MemoryContext GenerationContextCreate(MemoryContext parent, const char *name, - Size blockSize) + Size minContextSize, + Size initBlockSize, + Size maxBlockSize) { + Size firstBlockSize; + Size allocSize; GenerationContext *set; + GenerationBlock *block; /* Assert we padded GenerationChunk properly */ StaticAssertStmt(Generation_CHUNKHDRSZ == MAXALIGN(Generation_CHUNKHDRSZ), @@ -208,24 +231,33 @@ GenerationContextCreate(MemoryContext parent, "padding calculation in GenerationChunk is wrong"); /* - * First, validate allocation parameters. (If we're going to throw an - * error, we should do so before the context is created, not after.) We - * somewhat arbitrarily enforce a minimum 1K block size, mostly because - * that's what AllocSet does. + * First, validate allocation parameters. Asserts seem sufficient because + * nobody varies their parameters at runtime. We somewhat arbitrarily + * enforce a minimum 1K block size. */ - if (blockSize != MAXALIGN(blockSize) || - blockSize < 1024 || - !AllocHugeSizeIsValid(blockSize)) - elog(ERROR, "invalid blockSize for memory context: %zu", - blockSize); + Assert(initBlockSize == MAXALIGN(initBlockSize) && + initBlockSize >= 1024); + Assert(maxBlockSize == MAXALIGN(maxBlockSize) && + maxBlockSize >= initBlockSize && + AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ + Assert(minContextSize == 0 || + (minContextSize == MAXALIGN(minContextSize) && + minContextSize >= 1024 && + minContextSize <= maxBlockSize)); + + /* Determine size of initial block */ + allocSize = MAXALIGN(sizeof(GenerationContext)) + + Generation_BLOCKHDRSZ + Generation_CHUNKHDRSZ; + if (minContextSize != 0) + allocSize = Max(allocSize, minContextSize); + else + allocSize = Max(allocSize, initBlockSize); /* - * Allocate the context header. Unlike aset.c, we never try to combine - * this with the first regular block, since that would prevent us from - * freeing the first generation of allocations. + * Allocate the initial block. Unlike other generation.c blocks, it + * starts with the context header and its block header follows that. */ - - set = (GenerationContext *) malloc(MAXALIGN(sizeof(GenerationContext))); + set = (GenerationContext *) malloc(allocSize); if (set == NULL) { MemoryContextStats(TopMemoryContext); @@ -240,11 +272,40 @@ GenerationContextCreate(MemoryContext parent, * Avoid writing code that can fail between here and MemoryContextCreate; * we'd leak the header if we ereport in this stretch. */ + dlist_init(&set->blocks); + + /* Fill in the initial block's block header */ + block = (GenerationBlock *) (((char *) set) + MAXALIGN(sizeof(GenerationContext))); + /* determine the block size and initialize it */ + firstBlockSize = allocSize - MAXALIGN(sizeof(GenerationContext)); + GenerationBlockInit(block, firstBlockSize); + + /* add it to the doubly-linked list of blocks */ + dlist_push_head(&set->blocks, &block->node); + + /* use it as the current allocation block */ + set->block = block; + + /* No free block, yet */ + set->freeblock = NULL; + + /* Mark block as not to be released at reset time */ + set->keeper = block; /* Fill in GenerationContext-specific header fields */ - set->blockSize = blockSize; - set->block = NULL; - dlist_init(&set->blocks); + set->initBlockSize = initBlockSize; + set->maxBlockSize = maxBlockSize; + set->nextBlockSize = initBlockSize; + + /* + * Compute the allocation chunk size limit for this context. + * + * Follows similar ideas as AllocSet, see aset.c for details ... + */ + set->allocChunkLimit = maxBlockSize; + while ((Size) (set->allocChunkLimit + Generation_CHUNKHDRSZ) > + (Size) ((Size) (maxBlockSize - Generation_BLOCKHDRSZ) / Generation_CHUNK_FRACTION)) + set->allocChunkLimit >>= 1; /* Finally, do the type-independent part of context creation */ MemoryContextCreate((MemoryContext) set, @@ -253,6 +314,8 @@ GenerationContextCreate(MemoryContext parent, parent, name); + ((MemoryContext) set)->mem_allocated = firstBlockSize; + return (MemoryContext) set; } @@ -276,24 +339,32 @@ GenerationReset(MemoryContext context) GenerationCheck(context); #endif + /* + * NULLify the free block pointer. We must do this before calling + * GenerationBlockFree as that function never expects to free the + * freeblock. + */ + set->freeblock = NULL; + dlist_foreach_modify(miter, &set->blocks) { GenerationBlock *block = dlist_container(GenerationBlock, node, miter.cur); - dlist_delete(miter.cur); - - context->mem_allocated -= block->blksize; - -#ifdef CLOBBER_FREED_MEMORY - wipe_mem(block, block->blksize); -#endif - - free(block); + if (block == set->keeper) + GenerationBlockMarkEmpty(block); + else + GenerationBlockFree(set, block); } - set->block = NULL; + /* set it so new allocations to make use of the keeper block */ + set->block = set->keeper; - Assert(dlist_is_empty(&set->blocks)); + /* Reset block size allocation sequence, too */ + set->nextBlockSize = set->initBlockSize; + + /* Ensure there is only 1 item in the dlist */ + Assert(!dlist_is_empty(&set->blocks)); + Assert(!dlist_has_next(&set->blocks, dlist_head_node(&set->blocks))); } /* @@ -303,9 +374,9 @@ GenerationReset(MemoryContext context) static void GenerationDelete(MemoryContext context) { - /* Reset to release all the GenerationBlocks */ + /* Reset to release all releasable GenerationBlocks */ GenerationReset(context); - /* And free the context header */ + /* And free the context header and keeper block */ free(context); } @@ -329,11 +400,12 @@ GenerationAlloc(MemoryContext context, Size size) GenerationBlock *block; GenerationChunk *chunk; Size chunk_size = MAXALIGN(size); + Size required_size = chunk_size + Generation_CHUNKHDRSZ; /* is it an over-sized chunk? if yes, allocate special block */ - if (chunk_size > set->blockSize / 8) + if (chunk_size > set->allocChunkLimit) { - Size blksize = chunk_size + Generation_BLOCKHDRSZ + Generation_CHUNKHDRSZ; + Size blksize = required_size + Generation_BLOCKHDRSZ; block = (GenerationBlock *) malloc(blksize); if (block == NULL) @@ -379,36 +451,76 @@ GenerationAlloc(MemoryContext context, Size size) } /* - * Not an over-sized chunk. Is there enough space in the current block? If - * not, allocate a new "regular" block. + * Not an oversized chunk. We try to first make use of the current + * block, but if there's not enough space in it, instead of allocating a + * new block, we look to see if the freeblock is empty and has enough + * space. If not, we'll also try the same using the keeper block. The + * keeper block may have become empty and we have no other way to reuse it + * again if we don't try to use it explicitly here. + * + * We don't want to start filling the freeblock before the current block + * is full, otherwise we may cause fragmentation in FIFO type workloads. + * We only switch to using the freeblock or keeper block if those blocks + * are completely empty. If we didn't do that we could end up fragmenting + * consecutive allocations over multiple blocks which would be a problem + * that would compound over time. */ block = set->block; - if ((block == NULL) || - (block->endptr - block->freeptr) < Generation_CHUNKHDRSZ + chunk_size) + if (block == NULL || + GenerationBlockFreeBytes(block) < required_size) { - Size blksize = set->blockSize; + Size blksize; + GenerationBlock *freeblock = set->freeblock; - block = (GenerationBlock *) malloc(blksize); + if (freeblock != NULL && + GenerationBlockIsEmpty(freeblock) && + GenerationBlockFreeBytes(freeblock) >= required_size) + { + block = freeblock; - if (block == NULL) - return NULL; + /* Zero out the freeblock as we'll set this to the current block below */ + set->freeblock = NULL; + } + else if (GenerationBlockIsEmpty(set->keeper) && + GenerationBlockFreeBytes(set->keeper) >= required_size) + { + block = set->keeper; + } + else + { + /* + * The first such block has size initBlockSize, and we double the + * space in each succeeding block, but not more than maxBlockSize. + */ + blksize = set->nextBlockSize; + set->nextBlockSize <<= 1; + if (set->nextBlockSize > set->maxBlockSize) + set->nextBlockSize = set->maxBlockSize; - context->mem_allocated += blksize; + /* we'll need a block hdr too, so add that to the required size */ + required_size += Generation_BLOCKHDRSZ; - block->blksize = blksize; - block->nchunks = 0; - block->nfree = 0; + /* round the size up to the next power of 2 */ + if (blksize < required_size) + blksize = pg_nextpower2_size_t(required_size); - block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ; - block->endptr = ((char *) block) + blksize; + block = (GenerationBlock *) malloc(blksize); - /* Mark unallocated space NOACCESS. */ - VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, - blksize - Generation_BLOCKHDRSZ); + if (block == NULL) + return NULL; - /* add it to the doubly-linked list of blocks */ - dlist_push_head(&set->blocks, &block->node); + context->mem_allocated += blksize; + + /* initialize the new block */ + GenerationBlockInit(block, blksize); + + /* add it to the doubly-linked list of blocks */ + dlist_push_head(&set->blocks, &block->node); + + /* Zero out the freeblock in case it's become full */ + set->freeblock = NULL; + } /* and also use it as the current allocation block */ set->block = block; @@ -453,6 +565,95 @@ GenerationAlloc(MemoryContext context, Size size) return GenerationChunkGetPointer(chunk); } +/* + * GenerationBlockInit + * Initializes 'block' assuming 'blksize'. Does not update the context's + * mem_allocated field. + */ +static inline void +GenerationBlockInit(GenerationBlock *block, Size blksize) +{ + block->blksize = blksize; + block->nchunks = 0; + block->nfree = 0; + + block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ; + block->endptr = ((char *) block) + blksize; + + /* Mark unallocated space NOACCESS. */ + VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, + blksize - Generation_BLOCKHDRSZ); +} + +/* + * GenerationBlockIsEmpty + * Returns true iif 'block' contains no chunks + */ +static inline bool +GenerationBlockIsEmpty(GenerationBlock *block) +{ + return (block->nchunks == 0); +} + +/* + * GenerationBlockMarkEmpty + * Set a block as empty. Does not free the block. + */ +static inline void +GenerationBlockMarkEmpty(GenerationBlock *block) +{ +#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY) + char *datastart = ((char *) block) + Generation_BLOCKHDRSZ; +#endif + +#ifdef CLOBBER_FREED_MEMORY + wipe_mem(datastart, block->freeptr - datastart); +#else + /* wipe_mem() would have done this */ + VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart); +#endif + + /* Reset the block, but don't return it to malloc */ + block->nchunks = 0; + block->nfree = 0; + block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ; + +} + +/* + * GenerationBlockFreeBytes + * Returns the number of bytes free in 'block' + */ +static inline Size +GenerationBlockFreeBytes(GenerationBlock *block) +{ + return (block->endptr - block->freeptr); +} + +/* + * GenerationBlockFree + * Remove 'block' from 'context' and release the memory consumed by it. + */ +static inline void +GenerationBlockFree(GenerationContext *set, GenerationBlock *block) +{ + /* Make sure nobody tries to free the keeper block */ + Assert(block != set->keeper); + /* We shouldn't be freeing the freeblock either */ + Assert(block != set->freeblock); + + /* release the block from the list of blocks */ + dlist_delete(&block->node); + + ((MemoryContext) set)->mem_allocated -= block->blksize; + +#ifdef CLOBBER_FREED_MEMORY + wipe_mem(block, block->blksize); +#endif + + free(block); +} + /* * GenerationFree * Update number of chunks in the block, and if all chunks in the block @@ -499,16 +700,36 @@ GenerationFree(MemoryContext context, void *pointer) if (block->nfree < block->nchunks) return; + /* Don't try to free the keeper block, just mark it empty */ + if (block == set->keeper) + { + GenerationBlockMarkEmpty(block); + return; + } + /* - * The block is empty, so let's get rid of it. First remove it from the - * list of blocks, then return it to malloc(). + * If there is no freeblock set or if this is the freeblock then instead + * of freeing this memory, we keep it around so that new allocations have + * the option of recycling it. */ - dlist_delete(&block->node); + if (set->freeblock == NULL || set->freeblock == block) + { + /* XXX should we only recycle maxBlockSize sized blocks? */ + set->freeblock = block; + GenerationBlockMarkEmpty(block); + return; + } /* Also make sure the block is not marked as the current block. */ if (set->block == block) set->block = NULL; + /* + * The block is empty, so let's get rid of it. First remove it from the + * list of blocks, then return it to malloc(). + */ + dlist_delete(&block->node); + context->mem_allocated -= block->blksize; free(block); } @@ -655,8 +876,17 @@ static bool GenerationIsEmpty(MemoryContext context) { GenerationContext *set = (GenerationContext *) context; + dlist_iter iter; - return dlist_is_empty(&set->blocks); + dlist_foreach(iter, &set->blocks) + { + GenerationBlock *block = dlist_container(GenerationBlock, node, iter.cur); + + if (block->nchunks > 0) + return false; + } + + return true; } /* @@ -746,12 +976,11 @@ GenerationCheck(MemoryContext context) char *ptr; total_allocated += block->blksize; - /* - * nfree > nchunks is surely wrong, and we don't expect to see - * equality either, because such a block should have gotten freed. + * nfree > nchunks is surely wrong. Equality is allowed as the block + * might completely empty if it's the freeblock. */ - if (block->nfree >= block->nchunks) + if (block->nfree > block->nchunks) elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated", name, block->nfree, block, block->nchunks); diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h index 019700247a1..495d1af2010 100644 --- a/src/include/utils/memutils.h +++ b/src/include/utils/memutils.h @@ -183,7 +183,9 @@ extern MemoryContext SlabContextCreate(MemoryContext parent, /* generation.c */ extern MemoryContext GenerationContextCreate(MemoryContext parent, const char *name, - Size blockSize); + Size minContextSize, + Size initBlockSize, + Size maxBlockSize); /* * Recommended default alloc parameters, suitable for "ordinary" contexts -- 2.17.1
>From 899ea66ad28b78aabd8596c35c1bff309709cd1c Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Fri, 1 Apr 2022 20:04:58 +1300 Subject: [PATCH 2/4] Adjust tuplesort API to have bitwise flags instead of bool For now we only had a single bool option for randomAccess, however an upcoming patch requires adding another option, so instead of breaking the API for that, then breaking it again one day if we add more options, let's just break it once. Any options we add in the future will just make use of an unused bit in the flags. Author: David Rowley Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com --- src/backend/access/gist/gistbuild.c | 2 +- src/backend/access/hash/hashsort.c | 2 +- src/backend/access/heap/heapam_handler.c | 2 +- src/backend/access/nbtree/nbtsort.c | 6 +- src/backend/catalog/index.c | 2 +- src/backend/executor/nodeAgg.c | 6 +- src/backend/executor/nodeIncrementalSort.c | 4 +- src/backend/executor/nodeSort.c | 8 +- src/backend/utils/adt/orderedsetaggs.c | 10 +- src/backend/utils/sort/tuplesort.c | 101 +++++++++++---------- src/include/utils/tuplesort.h | 19 ++-- 11 files changed, 90 insertions(+), 72 deletions(-) diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index e081e6571a4..f5a5caff8ec 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -271,7 +271,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo) index, maintenance_work_mem, NULL, - false); + TUPLESORT_NONE); /* Scan the table, adding all tuples to the tuplesort */ reltuples = table_index_build_scan(heap, index, indexInfo, true, true, diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c index 6d8512283a8..aa61e39f26a 100644 --- a/src/backend/access/hash/hashsort.c +++ b/src/backend/access/hash/hashsort.c @@ -86,7 +86,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets) hspool->max_buckets, maintenance_work_mem, NULL, - false); + TUPLESORT_NONE); return hspool; } diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index dee264e8596..3a9532cb4f7 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -726,7 +726,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, if (use_sort) tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, maintenance_work_mem, - NULL, false); + NULL, TUPLESORT_NONE); else tuplesort = NULL; diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 8a19de2f66c..15605dac6bb 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -436,7 +436,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate, tuplesort_begin_index_btree(heap, index, buildstate->isunique, buildstate->nulls_not_distinct, maintenance_work_mem, coordinate, - false); + TUPLESORT_NONE); /* * If building a unique index, put dead tuples in a second spool to keep @@ -475,7 +475,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate, */ buildstate->spool2->sortstate = tuplesort_begin_index_btree(heap, index, false, false, work_mem, - coordinate2, false); + coordinate2, TUPLESORT_NONE); } /* Fill spool using either serial or parallel heap scan */ @@ -1939,7 +1939,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2, btspool->isunique, btspool->nulls_not_distinct, sortmem, coordinate, - false); + TUPLESORT_NONE); /* * Just as with serial case, there may be a second spool. If so, a diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index dd715ca0609..55800c94786 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -3364,7 +3364,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot) state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator, InvalidOid, false, maintenance_work_mem, - NULL, false); + NULL, TUPLESORT_NONE); state.htups = state.itups = state.tups_inserted = 0; /* ambulkdelete updates progress metrics */ diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 08cf569d8fa..23030a32a59 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -530,7 +530,7 @@ initialize_phase(AggState *aggstate, int newphase) sortnode->collations, sortnode->nullsFirst, work_mem, - NULL, false); + NULL, TUPLESORT_NONE); } aggstate->current_phase = newphase; @@ -607,7 +607,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans, pertrans->sortOperators[0], pertrans->sortCollations[0], pertrans->sortNullsFirst[0], - work_mem, NULL, false); + work_mem, NULL, TUPLESORT_NONE); } else pertrans->sortstates[aggstate->current_set] = @@ -617,7 +617,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans, pertrans->sortOperators, pertrans->sortCollations, pertrans->sortNullsFirst, - work_mem, NULL, false); + work_mem, NULL, TUPLESORT_NONE); } /* diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c index d6fb56dec73..4f50bc845da 100644 --- a/src/backend/executor/nodeIncrementalSort.c +++ b/src/backend/executor/nodeIncrementalSort.c @@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate) &(plannode->sort.nullsFirst[nPresortedCols]), work_mem, NULL, - false); + TUPLESORT_NONE); node->prefixsort_state = prefixsort_state; } else @@ -616,7 +616,7 @@ ExecIncrementalSort(PlanState *pstate) plannode->sort.nullsFirst, work_mem, NULL, - false); + TUPLESORT_NONE); node->fullsort_state = fullsort_state; } else diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c index 9481a622bf5..a113d737955 100644 --- a/src/backend/executor/nodeSort.c +++ b/src/backend/executor/nodeSort.c @@ -77,6 +77,7 @@ ExecSort(PlanState *pstate) Sort *plannode = (Sort *) node->ss.ps.plan; PlanState *outerNode; TupleDesc tupDesc; + int tuplesortopts = TUPLESORT_NONE; SO1_printf("ExecSort: %s\n", "sorting subplan"); @@ -96,6 +97,9 @@ ExecSort(PlanState *pstate) outerNode = outerPlanState(node); tupDesc = ExecGetResultType(outerNode); + if (node->randomAccess) + tuplesortopts |= TUPLESORT_RANDOMACCESS; + if (node->datumSort) tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid, plannode->sortOperators[0], @@ -103,7 +107,7 @@ ExecSort(PlanState *pstate) plannode->nullsFirst[0], work_mem, NULL, - node->randomAccess); + tuplesortopts); else tuplesortstate = tuplesort_begin_heap(tupDesc, plannode->numCols, @@ -113,7 +117,7 @@ ExecSort(PlanState *pstate) plannode->nullsFirst, work_mem, NULL, - node->randomAccess); + tuplesortopts); if (node->bounded) tuplesort_set_bound(tuplesortstate, node->bound); node->tuplesortstate = (void *) tuplesortstate; diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c index 96dae6ec4a8..6d4f6b7dca2 100644 --- a/src/backend/utils/adt/orderedsetaggs.c +++ b/src/backend/utils/adt/orderedsetaggs.c @@ -118,6 +118,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) OSAPerQueryState *qstate; MemoryContext gcontext; MemoryContext oldcontext; + int tuplesortopt; /* * Check we're called as aggregate (and not a window function), and get @@ -283,6 +284,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) osastate->qstate = qstate; osastate->gcontext = gcontext; + tuplesortopt = TUPLESORT_NONE; + + if (qstate->rescan_needed) + tuplesortopt |= TUPLESORT_RANDOMACCESS; + /* * Initialize tuplesort object. */ @@ -295,7 +301,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) qstate->sortNullsFirsts, work_mem, NULL, - qstate->rescan_needed); + tuplesortopt); else osastate->sortstate = tuplesort_begin_datum(qstate->sortColType, qstate->sortOperator, @@ -303,7 +309,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) qstate->sortNullsFirst, work_mem, NULL, - qstate->rescan_needed); + tuplesortopt); osastate->number_of_rows = 0; osastate->sort_done = false; diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 086e948fca6..5cc1328febf 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -246,7 +246,7 @@ struct Tuplesortstate { TupSortStatus status; /* enumerated value as shown above */ int nKeys; /* number of columns in sort key */ - bool randomAccess; /* did caller request random access? */ + int sortopt; /* Bitmask of flags used to setup sort */ bool bounded; /* did caller specify a maximum number of * tuples to return? */ bool boundUsed; /* true if we made use of a bounded heap */ @@ -558,12 +558,12 @@ struct Sharedsort * may or may not match the in-memory representation of the tuple --- * any conversion needed is the job of the writetup and readtup routines. * - * If state->randomAccess is true, then the stored representation of the - * tuple must be followed by another "unsigned int" that is a copy of the - * length --- so the total tape space used is actually sizeof(unsigned int) - * more than the stored length value. This allows read-backwards. When - * randomAccess is not true, the write/read routines may omit the extra - * length word. + * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored + * representation of the tuple must be followed by another "unsigned int" that + * is a copy of the length --- so the total tape space used is actually + * sizeof(unsigned int) more than the stored length value. This allows + * read-backwards. When the random access flag was not specified, the + * write/read routines may omit the extra length word. * * writetup is expected to write both length words as well as the tuple * data. When readtup is called, the tape is positioned just after the @@ -608,7 +608,7 @@ struct Sharedsort static Tuplesortstate *tuplesort_begin_common(int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); static void tuplesort_begin_batch(Tuplesortstate *state); static void puttuple_common(Tuplesortstate *state, SortTuple *tuple); static bool consider_abort_common(Tuplesortstate *state); @@ -713,21 +713,20 @@ static void tuplesort_updatemax(Tuplesortstate *state); * Each variant of tuplesort_begin has a workMem parameter specifying the * maximum number of kilobytes of RAM to use before spilling data to disk. * (The normal value of this parameter is work_mem, but some callers use - * other values.) Each variant also has a randomAccess parameter specifying - * whether the caller needs non-sequential access to the sort result. + * other values.) Each variant also has a sortopt which is a bitmask of + * sort options. See TUPLESORT_* definitions in tuplesort.h */ static Tuplesortstate * -tuplesort_begin_common(int workMem, SortCoordinate coordinate, - bool randomAccess) +tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt) { Tuplesortstate *state; MemoryContext maincontext; MemoryContext sortcontext; MemoryContext oldcontext; - /* See leader_takeover_tapes() remarks on randomAccess support */ - if (coordinate && randomAccess) + /* See leader_takeover_tapes() remarks on random access support */ + if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS)) elog(ERROR, "random access disallowed under parallel sort"); /* @@ -764,7 +763,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate, pg_rusage_init(&state->ru_start); #endif - state->randomAccess = randomAccess; + state->sortopt = sortopt; state->tuples = true; /* @@ -898,10 +897,10 @@ tuplesort_begin_heap(TupleDesc tupDesc, int nkeys, AttrNumber *attNums, Oid *sortOperators, Oid *sortCollations, bool *nullsFirstFlags, - int workMem, SortCoordinate coordinate, bool randomAccess) + int workMem, SortCoordinate coordinate, int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); MemoryContext oldcontext; int i; @@ -913,7 +912,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, if (trace_sort) elog(LOG, "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c", - nkeys, workMem, randomAccess ? 't' : 'f'); + nkeys, workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = nkeys; @@ -922,7 +921,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, false, /* no unique check */ nkeys, workMem, - randomAccess, + sortopt & TUPLESORT_RANDOMACCESS, PARALLEL_SORT(state)); state->comparetup = comparetup_heap; @@ -971,10 +970,10 @@ Tuplesortstate * tuplesort_begin_cluster(TupleDesc tupDesc, Relation indexRel, int workMem, - SortCoordinate coordinate, bool randomAccess) + SortCoordinate coordinate, int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); BTScanInsert indexScanKey; MemoryContext oldcontext; int i; @@ -988,7 +987,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, elog(LOG, "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c", RelationGetNumberOfAttributes(indexRel), - workMem, randomAccess ? 't' : 'f'); + workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel); @@ -997,7 +996,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, false, /* no unique check */ state->nKeys, workMem, - randomAccess, + sortopt & TUPLESORT_RANDOMACCESS, PARALLEL_SORT(state)); state->comparetup = comparetup_cluster; @@ -1069,10 +1068,10 @@ tuplesort_begin_index_btree(Relation heapRel, bool uniqueNullsNotDistinct, int workMem, SortCoordinate coordinate, - bool randomAccess) + int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); BTScanInsert indexScanKey; MemoryContext oldcontext; int i; @@ -1084,7 +1083,7 @@ tuplesort_begin_index_btree(Relation heapRel, elog(LOG, "begin index sort: unique = %c, workMem = %d, randomAccess = %c", enforceUnique ? 't' : 'f', - workMem, randomAccess ? 't' : 'f'); + workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel); @@ -1093,7 +1092,7 @@ tuplesort_begin_index_btree(Relation heapRel, enforceUnique, state->nKeys, workMem, - randomAccess, + sortopt & TUPLESORT_RANDOMACCESS, PARALLEL_SORT(state)); state->comparetup = comparetup_index_btree; @@ -1150,10 +1149,10 @@ tuplesort_begin_index_hash(Relation heapRel, uint32 max_buckets, int workMem, SortCoordinate coordinate, - bool randomAccess) + int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); MemoryContext oldcontext; oldcontext = MemoryContextSwitchTo(state->maincontext); @@ -1166,7 +1165,8 @@ tuplesort_begin_index_hash(Relation heapRel, high_mask, low_mask, max_buckets, - workMem, randomAccess ? 't' : 'f'); + workMem, + sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = 1; /* Only one sort column, the hash code */ @@ -1193,10 +1193,10 @@ tuplesort_begin_index_gist(Relation heapRel, Relation indexRel, int workMem, SortCoordinate coordinate, - bool randomAccess) + int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); MemoryContext oldcontext; int i; @@ -1206,7 +1206,7 @@ tuplesort_begin_index_gist(Relation heapRel, if (trace_sort) elog(LOG, "begin index sort: workMem = %d, randomAccess = %c", - workMem, randomAccess ? 't' : 'f'); + workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel); @@ -1248,10 +1248,10 @@ tuplesort_begin_index_gist(Relation heapRel, Tuplesortstate * tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, bool nullsFirstFlag, int workMem, - SortCoordinate coordinate, bool randomAccess) + SortCoordinate coordinate, int sortopt) { Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate, - randomAccess); + sortopt); MemoryContext oldcontext; int16 typlen; bool typbyval; @@ -1262,7 +1262,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, if (trace_sort) elog(LOG, "begin datum sort: workMem = %d, randomAccess = %c", - workMem, randomAccess ? 't' : 'f'); + workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f'); #endif state->nKeys = 1; /* always a one-column sort */ @@ -2165,7 +2165,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, switch (state->status) { case TSS_SORTEDINMEM: - Assert(forward || state->randomAccess); + Assert(forward || state->sortopt & TUPLESORT_RANDOMACCESS); Assert(!state->slabAllocatorUsed); if (forward) { @@ -2209,7 +2209,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, break; case TSS_SORTEDONTAPE: - Assert(forward || state->randomAccess); + Assert(forward || state->sortopt & TUPLESORT_RANDOMACCESS); Assert(state->slabAllocatorUsed); /* @@ -2984,7 +2984,8 @@ mergeruns(Tuplesortstate *state) * sorted tape, we can stop at this point and do the final merge * on-the-fly. */ - if (!state->randomAccess && state->nInputRuns <= state->nInputTapes + if ((state->sortopt & TUPLESORT_RANDOMACCESS) == 0 + && state->nInputRuns <= state->nInputTapes && !WORKER(state)) { /* Tell logtape.c we won't be writing anymore */ @@ -3230,7 +3231,7 @@ tuplesort_rescan(Tuplesortstate *state) { MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext); - Assert(state->randomAccess); + Assert(state->sortopt & TUPLESORT_RANDOMACCESS); switch (state->status) { @@ -3263,7 +3264,7 @@ tuplesort_markpos(Tuplesortstate *state) { MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext); - Assert(state->randomAccess); + Assert(state->sortopt & TUPLESORT_RANDOMACCESS); switch (state->status) { @@ -3294,7 +3295,7 @@ tuplesort_restorepos(Tuplesortstate *state) { MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext); - Assert(state->randomAccess); + Assert(state->sortopt & TUPLESORT_RANDOMACCESS); switch (state->status) { @@ -3853,7 +3854,7 @@ writetup_heap(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup) LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen)); LogicalTapeWrite(tape, (void *) tupbody, tupbodylen); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen)); if (!state->slabAllocatorUsed) @@ -3876,7 +3877,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup, /* read in the tuple proper */ tuple->t_len = tuplen; LogicalTapeReadExact(tape, tupbody, tupbodylen); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen)); stup->tuple = (void *) tuple; /* set up first-column key value */ @@ -4087,7 +4088,7 @@ writetup_cluster(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup) LogicalTapeWrite(tape, &tuplen, sizeof(tuplen)); LogicalTapeWrite(tape, &tuple->t_self, sizeof(ItemPointerData)); LogicalTapeWrite(tape, tuple->t_data, tuple->t_len); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeWrite(tape, &tuplen, sizeof(tuplen)); if (!state->slabAllocatorUsed) @@ -4113,7 +4114,7 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup, tuple->t_tableOid = InvalidOid; /* Read in the tuple body */ LogicalTapeReadExact(tape, tuple->t_data, tuple->t_len); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen)); stup->tuple = (void *) tuple; /* set up first-column key value, if it's a simple column */ @@ -4337,7 +4338,7 @@ writetup_index(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup) tuplen = IndexTupleSize(tuple) + sizeof(tuplen); LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen)); LogicalTapeWrite(tape, (void *) tuple, IndexTupleSize(tuple)); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen)); if (!state->slabAllocatorUsed) @@ -4355,7 +4356,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup, IndexTuple tuple = (IndexTuple) readtup_alloc(state, tuplen); LogicalTapeReadExact(tape, tuple, tuplen); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen)); stup->tuple = (void *) tuple; /* set up first-column key value */ @@ -4425,7 +4426,7 @@ writetup_datum(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup) LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen)); LogicalTapeWrite(tape, waddr, tuplen); - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen)); if (!state->slabAllocatorUsed && stup->tuple) @@ -4465,7 +4466,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup, stup->tuple = raddr; } - if (state->randomAccess) /* need trailing length word? */ + if (state->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */ LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen)); } diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h index da5ba591989..345f4ce8024 100644 --- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -86,6 +86,12 @@ typedef enum SORT_SPACE_TYPE_MEMORY } TuplesortSpaceType; +/* Bitwise option flags for tuple sorts */ +#define TUPLESORT_NONE 0 + +/* specifies whether non-sequential access to the sort result is required */ +#define TUPLESORT_RANDOMACCESS (1 << 0) + typedef struct TuplesortInstrumentation { TuplesortMethod sortMethod; /* sort algorithm used */ @@ -201,32 +207,33 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc, Oid *sortOperators, Oid *sortCollations, bool *nullsFirstFlags, int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc, Relation indexRel, int workMem, - SortCoordinate coordinate, bool randomAccess); + SortCoordinate coordinate, + int sortopt); extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel, Relation indexRel, bool enforceUnique, bool uniqueNullsNotDistinct, int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel, Relation indexRel, uint32 high_mask, uint32 low_mask, uint32 max_buckets, int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel, Relation indexRel, int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); extern Tuplesortstate *tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, bool nullsFirstFlag, int workMem, SortCoordinate coordinate, - bool randomAccess); + int sortopt); extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound); extern bool tuplesort_used_bound(Tuplesortstate *state); -- 2.17.1
>From b07d6406fccc50ff59d4f48cbf81a55f7eed8ab6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryz...@telsasoft.com> Date: Fri, 1 Apr 2022 07:00:57 -0500 Subject: [PATCH 3/4] f! --- src/backend/utils/sort/tuplesort.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 5cc1328febf..a0ab61e8f9a 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1271,7 +1271,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, false, /* no unique check */ 1, workMem, - randomAccess, + sortopt & TUPLESORT_RANDOMACCESS, PARALLEL_SORT(state)); state->comparetup = comparetup_datum; -- 2.17.1
>From 5a4904b1087795a6bb76ccb7d0ecf349dc10492f Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Fri, 1 Apr 2022 21:38:03 +1300 Subject: [PATCH 4/4] Use Generation memory contexts to store tuples in sorts The general usage pattern when we store tuples in tuplesort.c is that we store a series of tuples one by one then either perform a sort or spill them to disk. In the common case, there is no pfreeing of already stored tuples. For the common case since we do not individually pfree tuples, we have very little need for aset.c memory allocation behavior which maintains freelists and always rounds allocation sizes up to the next power of 2 size. Here we conditionally use generation.c contexts for storing tuples in tuplesort.c when the sort will never be bounded. Unfortunately, the memory context to store tuples is already created by the time any calls would be made to tuplesort_set_bound(), so here we add a new sort option that allows callers to specify if they're going to need a bounded sort or not. We'll use a standard aset.c allocator when this sort option is not set. Author: David Rowley Discussion: https://postgr.es/m/caaphdvoh4aszsaoyhcxkuy01qf++8jj0paw+03dk+w25tqe...@mail.gmail.com --- src/backend/executor/nodeIncrementalSort.c | 6 ++++-- src/backend/executor/nodeSort.c | 2 ++ src/backend/utils/sort/tuplesort.c | 20 ++++++++++++++++---- src/include/utils/tuplesort.h | 3 +++ 4 files changed, 25 insertions(+), 6 deletions(-) diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c index 4f50bc845da..3c5e8c9a969 100644 --- a/src/backend/executor/nodeIncrementalSort.c +++ b/src/backend/executor/nodeIncrementalSort.c @@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate) &(plannode->sort.nullsFirst[nPresortedCols]), work_mem, NULL, - TUPLESORT_NONE); + node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE); node->prefixsort_state = prefixsort_state; } else @@ -616,7 +616,9 @@ ExecIncrementalSort(PlanState *pstate) plannode->sort.nullsFirst, work_mem, NULL, - TUPLESORT_NONE); + node->bounded ? + TUPLESORT_ALLOWBOUNDED : + TUPLESORT_NONE); node->fullsort_state = fullsort_state; } else diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c index a113d737955..3c28d60c3ef 100644 --- a/src/backend/executor/nodeSort.c +++ b/src/backend/executor/nodeSort.c @@ -99,6 +99,8 @@ ExecSort(PlanState *pstate) if (node->randomAccess) tuplesortopts |= TUPLESORT_RANDOMACCESS; + if (node->bounded) + tuplesortopts |= TUPLESORT_ALLOWBOUNDED; if (node->datumSort) tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid, diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index a0ab61e8f9a..613f55ca21b 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -842,11 +842,21 @@ tuplesort_begin_batch(Tuplesortstate *state) * eases memory management. Resetting at key points reduces * fragmentation. Note that the memtuples array of SortTuples is allocated * in the parent context, not this context, because there is no need to - * free memtuples early. + * free memtuples early. For bounded sorts, tuples may be pfreed in any + * order, so we use a regular aset.c context so that it can make use of + * free'd memory. When the sort is not bounded, we make use of a + * generation.c context as this keeps allocations more compact with less + * wastage. Allocations are also slightly more CPU efficient. */ - state->tuplecontext = AllocSetContextCreate(state->sortcontext, - "Caller tuples", - ALLOCSET_DEFAULT_SIZES); + if (state->sortopt & TUPLESORT_ALLOWBOUNDED) + state->tuplecontext = AllocSetContextCreate(state->sortcontext, + "Caller tuples", + ALLOCSET_DEFAULT_SIZES); + else + state->tuplecontext = GenerationContextCreate(state->sortcontext, + "Caller tuples", + ALLOCSET_DEFAULT_SIZES); + state->status = TSS_INITIAL; state->bounded = false; @@ -1337,6 +1347,8 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound) { /* Assert we're called before loading any tuples */ Assert(state->status == TSS_INITIAL && state->memtupcount == 0); + /* Assert we allow bounded sorts */ + Assert(state->sortopt & TUPLESORT_ALLOWBOUNDED); /* Can't set the bound twice, either */ Assert(!state->bounded); /* Also, this shouldn't be called in a parallel worker */ diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h index 345f4ce8024..364cf132fcb 100644 --- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -92,6 +92,9 @@ typedef enum /* specifies whether non-sequential access to the sort result is required */ #define TUPLESORT_RANDOMACCESS (1 << 0) +/* specifies if the tuplesort is able to support bounded sorts */ +#define TUPLESORT_ALLOWBOUNDED (1 << 1) + typedef struct TuplesortInstrumentation { TuplesortMethod sortMethod; /* sort algorithm used */ -- 2.17.1