Re: Use generation context to speed up tuplesorts

David Rowley Fri, 01 Apr 2022 02:00:42 -0700

On Wed, 23 Mar 2022 at 04:08, David Rowley <[email protected]> wrote:
> If nobody has any objections to the 0001 patch then I'd like to move
> ahead with it in the next few days.  For the 0002 patch, I'm currently
> feeling like we shouldn't be using the Generation context for bounded
> sorts. The only way I can think to do that is with an API change to
> tuplesort. I feel 0001 is useful even without 0002.


0001:
I've made some small revisions to the 0001 patch so that the keeper
and freeblock are only used again when they're entirely empty.  The
situation I want to avoid here is that when the current block does not
have enough free space to store some new allocation, that we'll then
try the freeblock and then keeper block.  The problem I saw there is
that we may previously have partially filled the keeper or freeblock
block and have been unable to store some medium sized allocation which
caused us to move to a new block.  If we didn't check that the keeper
and freeblock blocks were empty first then we could end up being able
to store some small allocation in there where some previous medium
sized allocation couldn't fit.  While that's not necessarily an actual
problem, what it does mean is that consecutive allocations might not
end up in the same block or the next block.  Over time in a FIFO type
workload it would be possible to get fragmentation, which could result
in being unable to free blocks.  For longer lived contexts I imagine
that could end up fairly bad. The updated patch should avoid that
problem.

0002:
This modifies the tuplesort API so that instead of having a
randomAccess bool flag, this is changed to a bitwise flag that we can
add further options in the future.  It's slightly annoying to break
the API, but it's not exactly going to be hard for people to code
around that.  It might also mean we don't have to break the API in the
future if we're doing some change where we can just add a new bitwise
flag.

0003:
This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the
places where we set a sort bound to pass the flag.  The patch only
uses the generation context when the flag is not passed.

I feel this is a pretty simple patch and if nobody has any objections
then I plan to push all 3 of them on my New Zealand Monday morning.

David

From 053066f8276809af1b3f2c26e1e86b8e1db85bac Mon Sep 17 00:00:00 2001
From: David Rowley <[email protected]>
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH v5 1/3] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: 
https://postgr.es/m/[email protected]
---
 src/backend/access/gist/gistvacuum.c          |   6 +
 .../replication/logical/reorderbuffer.c       |   7 +
 src/backend/utils/mmgr/generation.c           | 383 ++++++++++++++----
 src/include/utils/memutils.h                  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c 
b/src/backend/access/gist/gistvacuum.c
index aac4afab8f..f190decdff 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, 
IndexBulkDeleteResult *stats,
         * pages in page_set_context.  Internally, the integer set will remember
         * this context so that the subsequent allocations for these integer 
sets
         * will be done from the same context.
+        *
+        * XXX the allocation sizes used below pre-date generation context's 
block
+        * growing code.  These values should likely be benchmarked and set to
+        * more suitable values.
         */
        vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
                                                                                
                          "GiST VACUUM page set context",
+                                                                               
                          16 * 1024,
+                                                                               
                          16 * 1024,
                                                                                
                          16 * 1024);
        oldctx = MemoryContextSwitchTo(vstate.page_set_context);
        vstate.internal_page_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fa..4702750a2e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -370,8 +370,15 @@ ReorderBufferAllocate(void)
                                                                                
        SLAB_DEFAULT_BLOCK_SIZE,
                                                                                
        sizeof(ReorderBufferTXN));
 
+       /*
+        * XXX the allocation sizes used below pre-date generation context's 
block
+        * growing code.  These values should likely be benchmarked and set to
+        * more suitable values.
+        */
        buffer->tup_context = GenerationContextCreate(new_ctx,
                                                                                
                  "Tuples",
+                                                                               
                  SLAB_LARGE_BLOCK_SIZE,
+                                                                               
                  SLAB_LARGE_BLOCK_SIZE,
                                                                                
                  SLAB_LARGE_BLOCK_SIZE);
 
        hash_ctl.keysize = sizeof(TransactionId);
diff --git a/src/backend/utils/mmgr/generation.c 
b/src/backend/utils/mmgr/generation.c
index 95c006f689..2906520155 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -20,18 +20,15 @@
  *
  *     The memory context uses a very simple approach to free space management.
  *     Instead of a complex global freelist, each block tracks a number
- *     of allocated and freed chunks. Freed chunks are not reused, and once all
- *     chunks in a block are freed, the whole block is thrown away. When the
- *     chunks allocated in the same block have similar lifespan, this works
- *     very well and is very cheap.
+ *     of allocated and freed chunks.  The block is classed as empty when the
+ *     number of free chunks is equal to the number of allocated chunks.  When
+ *     this occurs, instead of freeing the block, we try to "recycle" it, i.e.
+ *     reuse it for new allocations.  This is done by setting the block in the
+ *     context's 'freeblock' field.  If the freeblock field is already occupied
+ *     by another free block we simply return the newly empty block to malloc.
  *
- *     The current implementation only uses a fixed block size - maybe it 
should
- *     adapt a min/max block size range, and grow the blocks automatically.
- *     It already uses dedicated blocks for oversized chunks.
- *
- *     XXX It might be possible to improve this by keeping a small freelist for
- *     only a small number of recent blocks, but it's not clear it's worth the
- *     additional complexity.
+ *     This approach to free blocks requires fewer malloc/free calls for truely
+ *     first allocated, first free'd allocation patterns.
  *
  *-------------------------------------------------------------------------
  */
@@ -39,6 +36,7 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -46,6 +44,8 @@
 #define Generation_BLOCKHDRSZ  MAXALIGN(sizeof(GenerationBlock))
 #define Generation_CHUNKHDRSZ  sizeof(GenerationChunk)
 
+#define Generation_CHUNK_FRACTION      8
+
 typedef struct GenerationBlock GenerationBlock; /* forward reference */
 typedef struct GenerationChunk GenerationChunk;
 
@@ -60,20 +60,28 @@ typedef struct GenerationContext
        MemoryContextData header;       /* Standard memory-context fields */
 
        /* Generational context parameters */
-       Size            blockSize;              /* standard block size */
-
-       GenerationBlock *block;         /* current (most recently allocated) 
block */
+       Size            initBlockSize;  /* initial block size */
+       Size            maxBlockSize;   /* maximum block size */
+       Size            nextBlockSize;  /* next block size to allocate */
+       Size            allocChunkLimit;        /* effective chunk size limit */
+
+       GenerationBlock *block;         /* current (most recently allocated) 
block, or
+                                                                * NULL if 
we've just freed the most recent
+                                                                * block */
+       GenerationBlock *freeblock;     /* pointer to a block that's being 
recycled,
+                                                                * or NULL if 
there's no such block. */
+       GenerationBlock *keeper;        /* keep this block over resets */
        dlist_head      blocks;                 /* list of blocks */
 } GenerationContext;
 
 /*
  * GenerationBlock
  *             GenerationBlock is the unit of memory that is obtained by 
generation.c
- *             from malloc().  It contains one or more GenerationChunks, which 
are
+ *             from malloc().  It contains zero or more GenerationChunks, 
which are
  *             the units requested by palloc() and freed by pfree().  
GenerationChunks
  *             cannot be returned to malloc() individually, instead pfree()
  *             updates the free counter of the block and when all chunks in a 
block
- *             are free the whole block is returned to malloc().
+ *             are free the whole block can be returned to malloc().
  *
  *             GenerationBlock is the header data for a block --- the usable 
space
  *             within the block begins at the next alignment boundary.
@@ -143,6 +151,14 @@ struct GenerationChunk
 #define GenerationChunkGetPointer(chk) \
        ((GenerationPointer *)(((char *)(chk)) + Generation_CHUNKHDRSZ))
 
+/* Inlined helper functions */
+static inline void GenerationBlockInit(GenerationBlock *block, Size blksize);
+static inline bool GenerationBlockIsEmpty(GenerationBlock *block);
+static inline void GenerationBlockMarkEmpty(GenerationBlock *block);
+static inline Size GenerationBlockFreeBytes(GenerationBlock *block);
+static inline void GenerationBlockFree(GenerationContext *set,
+                                                                          
GenerationBlock *block);
+
 /*
  * These functions implement the MemoryContext API for Generation contexts.
  */
@@ -191,14 +207,21 @@ static const MemoryContextMethods GenerationMethods = {
  *
  * parent: parent context, or NULL if top-level context
  * name: name of context (must be statically allocated)
- * blockSize: generation block size
+ * minContextSize: minimum context size
+ * initBlockSize: initial allocation block size
+ * maxBlockSize: maximum allocation block size
  */
 MemoryContext
 GenerationContextCreate(MemoryContext parent,
                                                const char *name,
-                                               Size blockSize)
+                                               Size minContextSize,
+                                               Size initBlockSize,
+                                               Size maxBlockSize)
 {
+       Size            firstBlockSize;
+       Size            allocSize;
        GenerationContext *set;
+       GenerationBlock    *block;
 
        /* Assert we padded GenerationChunk properly */
        StaticAssertStmt(Generation_CHUNKHDRSZ == 
MAXALIGN(Generation_CHUNKHDRSZ),
@@ -208,24 +231,33 @@ GenerationContextCreate(MemoryContext parent,
                                         "padding calculation in 
GenerationChunk is wrong");
 
        /*
-        * First, validate allocation parameters.  (If we're going to throw an
-        * error, we should do so before the context is created, not after.)  We
-        * somewhat arbitrarily enforce a minimum 1K block size, mostly because
-        * that's what AllocSet does.
+        * First, validate allocation parameters.  Asserts seem sufficient 
because
+        * nobody varies their parameters at runtime.  We somewhat arbitrarily
+        * enforce a minimum 1K block size.
         */
-       if (blockSize != MAXALIGN(blockSize) ||
-               blockSize < 1024 ||
-               !AllocHugeSizeIsValid(blockSize))
-               elog(ERROR, "invalid blockSize for memory context: %zu",
-                        blockSize);
+       Assert(initBlockSize == MAXALIGN(initBlockSize) &&
+                  initBlockSize >= 1024);
+       Assert(maxBlockSize == MAXALIGN(maxBlockSize) &&
+                  maxBlockSize >= initBlockSize &&
+                  AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to 
double */
+       Assert(minContextSize == 0 ||
+                  (minContextSize == MAXALIGN(minContextSize) &&
+                       minContextSize >= 1024 &&
+                       minContextSize <= maxBlockSize));
+
+       /* Determine size of initial block */
+       allocSize = MAXALIGN(sizeof(GenerationContext)) +
+               Generation_BLOCKHDRSZ + Generation_CHUNKHDRSZ;
+       if (minContextSize != 0)
+               allocSize = Max(allocSize, minContextSize);
+       else
+               allocSize = Max(allocSize, initBlockSize);
 
        /*
-        * Allocate the context header.  Unlike aset.c, we never try to combine
-        * this with the first regular block, since that would prevent us from
-        * freeing the first generation of allocations.
+        * Allocate the initial block.  Unlike other generation.c blocks, it
+        * starts with the context header and its block header follows that.
         */
-
-       set = (GenerationContext *) malloc(MAXALIGN(sizeof(GenerationContext)));
+       set = (GenerationContext *) malloc(allocSize);
        if (set == NULL)
        {
                MemoryContextStats(TopMemoryContext);
@@ -240,11 +272,40 @@ GenerationContextCreate(MemoryContext parent,
         * Avoid writing code that can fail between here and 
MemoryContextCreate;
         * we'd leak the header if we ereport in this stretch.
         */
+       dlist_init(&set->blocks);
+
+        /* Fill in the initial block's block header */
+       block = (GenerationBlock *) (((char *) set) + 
MAXALIGN(sizeof(GenerationContext)));
+       /* determine the block size and initialize it */
+       firstBlockSize = allocSize - MAXALIGN(sizeof(GenerationContext));
+       GenerationBlockInit(block, firstBlockSize);
+
+       /* add it to the doubly-linked list of blocks */
+       dlist_push_head(&set->blocks, &block->node);
+
+       /* use it as the current allocation block */
+       set->block = block;
+
+       /* No free block, yet */
+       set->freeblock = NULL;
+
+       /* Mark block as not to be released at reset time */
+       set->keeper = block;
 
        /* Fill in GenerationContext-specific header fields */
-       set->blockSize = blockSize;
-       set->block = NULL;
-       dlist_init(&set->blocks);
+       set->initBlockSize = initBlockSize;
+       set->maxBlockSize = maxBlockSize;
+       set->nextBlockSize = initBlockSize;
+
+       /*
+        * Compute the allocation chunk size limit for this context.
+        *
+        * Follows similar ideas as AllocSet, see aset.c for details ...
+        */
+       set->allocChunkLimit = maxBlockSize;
+       while ((Size) (set->allocChunkLimit + Generation_CHUNKHDRSZ) >
+                  (Size) ((Size) (maxBlockSize - Generation_BLOCKHDRSZ) / 
Generation_CHUNK_FRACTION))
+               set->allocChunkLimit >>= 1;
 
        /* Finally, do the type-independent part of context creation */
        MemoryContextCreate((MemoryContext) set,
@@ -253,6 +314,8 @@ GenerationContextCreate(MemoryContext parent,
                                                parent,
                                                name);
 
+       ((MemoryContext) set)->mem_allocated = firstBlockSize;
+
        return (MemoryContext) set;
 }
 
@@ -276,24 +339,32 @@ GenerationReset(MemoryContext context)
        GenerationCheck(context);
 #endif
 
+       /*
+        * NULLify the free block pointer.  We must do this before calling
+        * GenerationBlockFree as that function never expects to free the
+        * freeblock.
+        */
+       set->freeblock = NULL;
+
        dlist_foreach_modify(miter, &set->blocks)
        {
                GenerationBlock *block = dlist_container(GenerationBlock, node, 
miter.cur);
 
-               dlist_delete(miter.cur);
-
-               context->mem_allocated -= block->blksize;
-
-#ifdef CLOBBER_FREED_MEMORY
-               wipe_mem(block, block->blksize);
-#endif
-
-               free(block);
+               if (block == set->keeper)
+                       GenerationBlockMarkEmpty(block);
+               else
+                       GenerationBlockFree(set, block);
        }
 
-       set->block = NULL;
+       /* set it so new allocations to make use of the keeper block */
+       set->block = set->keeper;
 
-       Assert(dlist_is_empty(&set->blocks));
+       /* Reset block size allocation sequence, too */
+       set->nextBlockSize = set->initBlockSize;
+
+       /* Ensure there is only 1 item in the dlist */
+       Assert(!dlist_is_empty(&set->blocks));
+       Assert(!dlist_has_next(&set->blocks, dlist_head_node(&set->blocks)));
 }
 
 /*
@@ -303,9 +374,9 @@ GenerationReset(MemoryContext context)
 static void
 GenerationDelete(MemoryContext context)
 {
-       /* Reset to release all the GenerationBlocks */
+       /* Reset to release all releasable GenerationBlocks */
        GenerationReset(context);
-       /* And free the context header */
+       /* And free the context header and keeper block */
        free(context);
 }
 
@@ -329,11 +400,12 @@ GenerationAlloc(MemoryContext context, Size size)
        GenerationBlock *block;
        GenerationChunk *chunk;
        Size            chunk_size = MAXALIGN(size);
+       Size            required_size = chunk_size + Generation_CHUNKHDRSZ;
 
        /* is it an over-sized chunk? if yes, allocate special block */
-       if (chunk_size > set->blockSize / 8)
+       if (chunk_size > set->allocChunkLimit)
        {
-               Size            blksize = chunk_size + Generation_BLOCKHDRSZ + 
Generation_CHUNKHDRSZ;
+               Size            blksize = required_size + Generation_BLOCKHDRSZ;
 
                block = (GenerationBlock *) malloc(blksize);
                if (block == NULL)
@@ -379,36 +451,76 @@ GenerationAlloc(MemoryContext context, Size size)
        }
 
        /*
-        * Not an over-sized chunk. Is there enough space in the current block? 
If
-        * not, allocate a new "regular" block.
+        * Not an oversized chunk.  We try to first make use of the current
+        * block, but if there's not enough space in it, instead of allocating a
+        * new block, we look to see if the freeblock is empty and has enough
+        * space.  If not, we'll also try the same using the keeper block.  The
+        * keeper block may have become empty and we have no other way to reuse 
it
+        * again if we don't try to use it explicitly here.
+        *
+        * We don't want to start filling the freeblock before the current block
+        * is full, otherwise we may cause fragmentation in FIFO type workloads.
+        * We only switch to using the freeblock or keeper block if those blocks
+        * are completely empty.  If we didn't do that we could end up 
fragmenting
+        * consecutive allocations over multiple blocks which would be a problem
+        * that would compound over time.
         */
        block = set->block;
 
-       if ((block == NULL) ||
-               (block->endptr - block->freeptr) < Generation_CHUNKHDRSZ + 
chunk_size)
+       if (block == NULL ||
+               GenerationBlockFreeBytes(block) < required_size)
        {
-               Size            blksize = set->blockSize;
+               Size            blksize;
+               GenerationBlock *freeblock = set->freeblock;
 
-               block = (GenerationBlock *) malloc(blksize);
+               if (freeblock != NULL &&
+                       GenerationBlockIsEmpty(freeblock) &&
+                       GenerationBlockFreeBytes(freeblock) >= required_size)
+               {
+                       block = freeblock;
 
-               if (block == NULL)
-                       return NULL;
+                       /* Zero out the freeblock as we'll set this to the 
current block below */
+                       set->freeblock = NULL;
+               }
+               else if (GenerationBlockIsEmpty(set->keeper) &&
+                                GenerationBlockFreeBytes(set->keeper) >= 
required_size)
+               {
+                       block = set->keeper;
+               }
+               else
+               {
+                       /*
+                        * The first such block has size initBlockSize, and we 
double the
+                        * space in each succeeding block, but not more than 
maxBlockSize.
+                        */
+                       blksize = set->nextBlockSize;
+                       set->nextBlockSize <<= 1;
+                       if (set->nextBlockSize > set->maxBlockSize)
+                               set->nextBlockSize = set->maxBlockSize;
 
-               context->mem_allocated += blksize;
+                       /* we'll need a block hdr too, so add that to the 
required size */
+                       required_size += Generation_BLOCKHDRSZ;
 
-               block->blksize = blksize;
-               block->nchunks = 0;
-               block->nfree = 0;
+                       /* round the size up to the next power of 2 */
+                       if (blksize < required_size)
+                               blksize = pg_nextpower2_size_t(required_size);
 
-               block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
-               block->endptr = ((char *) block) + blksize;
+                       block = (GenerationBlock *) malloc(blksize);
 
-               /* Mark unallocated space NOACCESS. */
-               VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
-                                                                  blksize - 
Generation_BLOCKHDRSZ);
+                       if (block == NULL)
+                               return NULL;
 
-               /* add it to the doubly-linked list of blocks */
-               dlist_push_head(&set->blocks, &block->node);
+                       context->mem_allocated += blksize;
+
+                       /* initialize the new block */
+                       GenerationBlockInit(block, blksize);
+
+                       /* add it to the doubly-linked list of blocks */
+                       dlist_push_head(&set->blocks, &block->node);
+
+                       /* Zero out the freeblock in case it's become full */
+                       set->freeblock = NULL;
+               }
 
                /* and also use it as the current allocation block */
                set->block = block;
@@ -453,6 +565,95 @@ GenerationAlloc(MemoryContext context, Size size)
        return GenerationChunkGetPointer(chunk);
 }
 
+/*
+ * GenerationBlockInit
+ *             Initializes 'block' assuming 'blksize'.  Does not update the 
context's
+ *             mem_allocated field.
+ */
+static inline void
+GenerationBlockInit(GenerationBlock *block, Size blksize)
+{
+       block->blksize = blksize;
+       block->nchunks = 0;
+       block->nfree = 0;
+
+       block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
+       block->endptr = ((char *) block) + blksize;
+
+       /* Mark unallocated space NOACCESS. */
+       VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
+                                                          blksize - 
Generation_BLOCKHDRSZ);
+}
+
+/*
+ * GenerationBlockIsEmpty
+ *             Returns true iif 'block' contains no chunks
+ */
+static inline bool
+GenerationBlockIsEmpty(GenerationBlock *block)
+{
+       return (block->nchunks == 0);
+}
+
+/*
+ * GenerationBlockMarkEmpty
+ *             Set a block as empty.  Does not free the block.
+ */
+static inline void
+GenerationBlockMarkEmpty(GenerationBlock *block)
+{
+#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY)
+       char       *datastart = ((char *) block) + Generation_BLOCKHDRSZ;
+#endif
+
+#ifdef CLOBBER_FREED_MEMORY
+       wipe_mem(datastart, block->freeptr - datastart);
+#else
+       /* wipe_mem() would have done this */
+       VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
+#endif
+
+       /* Reset the block, but don't return it to malloc */
+       block->nchunks = 0;
+       block->nfree = 0;
+       block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
+
+}
+
+/*
+ * GenerationBlockFreeBytes
+ *             Returns the number of bytes free in 'block'
+ */
+static inline Size
+GenerationBlockFreeBytes(GenerationBlock *block)
+{
+       return (block->endptr - block->freeptr);
+}
+
+/*
+ * GenerationBlockFree
+ *             Remove 'block' from 'context' and release the memory consumed 
by it.
+ */
+static inline void
+GenerationBlockFree(GenerationContext *set, GenerationBlock *block)
+{
+       /* Make sure nobody tries to free the keeper block */
+       Assert(block != set->keeper);
+       /* We shouldn't be freeing the freeblock either */
+       Assert(block != set->freeblock);
+
+       /* release the block from the list of blocks */
+       dlist_delete(&block->node);
+
+       ((MemoryContext) set)->mem_allocated -= block->blksize;
+
+#ifdef CLOBBER_FREED_MEMORY
+       wipe_mem(block, block->blksize);
+#endif
+
+       free(block);
+}
+
 /*
  * GenerationFree
  *             Update number of chunks in the block, and if all chunks in the 
block
@@ -499,16 +700,36 @@ GenerationFree(MemoryContext context, void *pointer)
        if (block->nfree < block->nchunks)
                return;
 
+       /* Don't try to free the keeper block, just mark it empty */
+       if (block == set->keeper)
+       {
+               GenerationBlockMarkEmpty(block);
+               return;
+       }
+
        /*
-        * The block is empty, so let's get rid of it. First remove it from the
-        * list of blocks, then return it to malloc().
+        * If there is no freeblock set or if this is the freeblock then instead
+        * of freeing this memory, we keep it around so that new allocations 
have
+        * the option of recycling it.
         */
-       dlist_delete(&block->node);
+       if (set->freeblock == NULL || set->freeblock == block)
+       {
+               /* XXX should we only recycle maxBlockSize sized blocks? */
+               set->freeblock = block;
+               GenerationBlockMarkEmpty(block);
+               return;
+       }
 
        /* Also make sure the block is not marked as the current block. */
        if (set->block == block)
                set->block = NULL;
 
+       /*
+        * The block is empty, so let's get rid of it. First remove it from the
+        * list of blocks, then return it to malloc().
+        */
+       dlist_delete(&block->node);
+
        context->mem_allocated -= block->blksize;
        free(block);
 }
@@ -655,8 +876,17 @@ static bool
 GenerationIsEmpty(MemoryContext context)
 {
        GenerationContext *set = (GenerationContext *) context;
+       dlist_iter      iter;
 
-       return dlist_is_empty(&set->blocks);
+       dlist_foreach(iter, &set->blocks)
+       {
+               GenerationBlock *block = dlist_container(GenerationBlock, node, 
iter.cur);
+
+               if (block->nchunks > 0)
+                       return false;
+       }
+
+       return true;
 }
 
 /*
@@ -746,12 +976,11 @@ GenerationCheck(MemoryContext context)
                char       *ptr;
 
                total_allocated += block->blksize;
-
                /*
-                * nfree > nchunks is surely wrong, and we don't expect to see
-                * equality either, because such a block should have gotten 
freed.
+                * nfree > nchunks is surely wrong.  Equality is allowed as the 
block
+                * might completely empty if it's the freeblock.
                 */
-               if (block->nfree >= block->nchunks)
+               if (block->nfree > block->nchunks)
                        elog(WARNING, "problem in Generation %s: number of free 
chunks %d in block %p exceeds %d allocated",
                                 name, block->nfree, block, block->nchunks);
 
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 019700247a..495d1af201 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -183,7 +183,9 @@ extern MemoryContext SlabContextCreate(MemoryContext parent,
 /* generation.c */
 extern MemoryContext GenerationContextCreate(MemoryContext parent,
                                                                                
         const char *name,
-                                                                               
         Size blockSize);
+                                                                               
         Size minContextSize,
+                                                                               
         Size initBlockSize,
+                                                                               
         Size maxBlockSize);
 
 /*
  * Recommended default alloc parameters, suitable for "ordinary" contexts
-- 
2.32.0

From 8df0112a3554925fa77271e45cf520501fa71b91 Mon Sep 17 00:00:00 2001
From: David Rowley <[email protected]>
Date: Fri, 1 Apr 2022 20:04:58 +1300
Subject: [PATCH v5 2/3] Adjust tuplesort API to have bitwise flags instead of
 bool

For now we only had a single bool option for randomAccess, however an
upcoming patch requires adding another option, so instead of breaking the
API for that, then breaking it again one day if we add more options, let's
just break it once.  Any options we add in the future will just make use
of an unused bit in the flags.

Author: David Rowley
Discussion: 
https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
---
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/hash/hashsort.c         |   2 +-
 src/backend/access/heap/heapam_handler.c   |   2 +-
 src/backend/access/nbtree/nbtsort.c        |   6 +-
 src/backend/catalog/index.c                |   2 +-
 src/backend/executor/nodeAgg.c             |   6 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeSort.c            |   8 +-
 src/backend/utils/adt/orderedsetaggs.c     |  10 +-
 src/backend/utils/sort/tuplesort.c         | 101 +++++++++++----------
 src/include/utils/tuplesort.h              |  19 ++--
 11 files changed, 90 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/gist/gistbuild.c 
b/src/backend/access/gist/gistbuild.c
index e081e6571a..f5a5caff8e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -271,7 +271,7 @@ gistbuild(Relation heap, Relation index, IndexInfo 
*indexInfo)
                                                                                
                                  index,
                                                                                
                                  maintenance_work_mem,
                                                                                
                                  NULL,
-                                                                               
                                  false);
+                                                                               
                                  TUPLESORT_NONE);
 
                /* Scan the table, adding all tuples to the tuplesort */
                reltuples = table_index_build_scan(heap, index, indexInfo, 
true, true,
diff --git a/src/backend/access/hash/hashsort.c 
b/src/backend/access/hash/hashsort.c
index 6d8512283a..aa61e39f26 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -86,7 +86,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 
num_buckets)
                                                                                
                   hspool->max_buckets,
                                                                                
                   maintenance_work_mem,
                                                                                
                   NULL,
-                                                                               
                   false);
+                                                                               
                   TUPLESORT_NONE);
 
        return hspool;
 }
diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index dee264e859..3a9532cb4f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -726,7 +726,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation 
NewHeap,
        if (use_sort)
                tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
                                                                                
        maintenance_work_mem,
-                                                                               
        NULL, false);
+                                                                               
        NULL, TUPLESORT_NONE);
        else
                tuplesort = NULL;
 
diff --git a/src/backend/access/nbtree/nbtsort.c 
b/src/backend/access/nbtree/nbtsort.c
index c074513efa..9f60fa9894 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -436,7 +436,7 @@ _bt_spools_heapscan(Relation heap, Relation index, 
BTBuildState *buildstate,
                tuplesort_begin_index_btree(heap, index, buildstate->isunique,
                                                                        
buildstate->nulls_not_distinct,
                                                                        
maintenance_work_mem, coordinate,
-                                                                       false);
+                                                                       
TUPLESORT_NONE);
 
        /*
         * If building a unique index, put dead tuples in a second spool to keep
@@ -475,7 +475,7 @@ _bt_spools_heapscan(Relation heap, Relation index, 
BTBuildState *buildstate,
                 */
                buildstate->spool2->sortstate =
                        tuplesort_begin_index_btree(heap, index, false, false, 
work_mem,
-                                                                               
coordinate2, false);
+                                                                               
coordinate2, TUPLESORT_NONE);
        }
 
        /* Fill spool using either serial or parallel heap scan */
@@ -1939,7 +1939,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool 
*btspool2,
                                                                                
                         btspool->isunique,
                                                                                
                         btspool->nulls_not_distinct,
                                                                                
                         sortmem, coordinate,
-                                                                               
                         false);
+                                                                               
                         TUPLESORT_NONE);
 
        /*
         * Just as with serial case, there may be a second spool.  If so, a
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index dd715ca060..55800c9478 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3364,7 +3364,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
        state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
                                                                                
        InvalidOid, false,
                                                                                
        maintenance_work_mem,
-                                                                               
        NULL, false);
+                                                                               
        NULL, TUPLESORT_NONE);
        state.htups = state.itups = state.tups_inserted = 0;
 
        /* ambulkdelete updates progress metrics */
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 08cf569d8f..23030a32a5 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -530,7 +530,7 @@ initialize_phase(AggState *aggstate, int newphase)
                                                                                
                  sortnode->collations,
                                                                                
                  sortnode->nullsFirst,
                                                                                
                  work_mem,
-                                                                               
                  NULL, false);
+                                                                               
                  NULL, TUPLESORT_NONE);
        }
 
        aggstate->current_phase = newphase;
@@ -607,7 +607,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans 
pertrans,
                                                                          
pertrans->sortOperators[0],
                                                                          
pertrans->sortCollations[0],
                                                                          
pertrans->sortNullsFirst[0],
-                                                                         
work_mem, NULL, false);
+                                                                         
work_mem, NULL, TUPLESORT_NONE);
                }
                else
                        pertrans->sortstates[aggstate->current_set] =
@@ -617,7 +617,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans 
pertrans,
                                                                         
pertrans->sortOperators,
                                                                         
pertrans->sortCollations,
                                                                         
pertrans->sortNullsFirst,
-                                                                        
work_mem, NULL, false);
+                                                                        
work_mem, NULL, TUPLESORT_NONE);
        }
 
        /*
diff --git a/src/backend/executor/nodeIncrementalSort.c 
b/src/backend/executor/nodeIncrementalSort.c
index d6fb56dec7..4f50bc845d 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
                                                                                
                &(plannode->sort.nullsFirst[nPresortedCols]),
                                                                                
                work_mem,
                                                                                
                NULL,
-                                                                               
                false);
+                                                                               
                TUPLESORT_NONE);
                node->prefixsort_state = prefixsort_state;
        }
        else
@@ -616,7 +616,7 @@ ExecIncrementalSort(PlanState *pstate)
                                                                                
                  plannode->sort.nullsFirst,
                                                                                
                  work_mem,
                                                                                
                  NULL,
-                                                                               
                  false);
+                                                                               
                  TUPLESORT_NONE);
                        node->fullsort_state = fullsort_state;
                }
                else
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9481a622bf..a113d73795 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -77,6 +77,7 @@ ExecSort(PlanState *pstate)
                Sort       *plannode = (Sort *) node->ss.ps.plan;
                PlanState  *outerNode;
                TupleDesc       tupDesc;
+               int                     tuplesortopts = TUPLESORT_NONE;
 
                SO1_printf("ExecSort: %s\n",
                                   "sorting subplan");
@@ -96,6 +97,9 @@ ExecSort(PlanState *pstate)
                outerNode = outerPlanState(node);
                tupDesc = ExecGetResultType(outerNode);
 
+               if (node->randomAccess)
+                       tuplesortopts |= TUPLESORT_RANDOMACCESS;
+
                if (node->datumSort)
                        tuplesortstate = 
tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
                                                                                
                   plannode->sortOperators[0],
@@ -103,7 +107,7 @@ ExecSort(PlanState *pstate)
                                                                                
                   plannode->nullsFirst[0],
                                                                                
                   work_mem,
                                                                                
                   NULL,
-                                                                               
                   node->randomAccess);
+                                                                               
                   tuplesortopts);
                else
                        tuplesortstate = tuplesort_begin_heap(tupDesc,
                                                                                
                  plannode->numCols,
@@ -113,7 +117,7 @@ ExecSort(PlanState *pstate)
                                                                                
                  plannode->nullsFirst,
                                                                                
                  work_mem,
                                                                                
                  NULL,
-                                                                               
                  node->randomAccess);
+                                                                               
                  tuplesortopts);
                if (node->bounded)
                        tuplesort_set_bound(tuplesortstate, node->bound);
                node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/utils/adt/orderedsetaggs.c 
b/src/backend/utils/adt/orderedsetaggs.c
index 96dae6ec4a..6d4f6b7dca 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -118,6 +118,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool 
use_tuples)
        OSAPerQueryState *qstate;
        MemoryContext gcontext;
        MemoryContext oldcontext;
+       int                     tuplesortopt;
 
        /*
         * Check we're called as aggregate (and not a window function), and get
@@ -283,6 +284,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool 
use_tuples)
        osastate->qstate = qstate;
        osastate->gcontext = gcontext;
 
+       tuplesortopt = TUPLESORT_NONE;
+
+       if (qstate->rescan_needed)
+               tuplesortopt |= TUPLESORT_RANDOMACCESS;
+
        /*
         * Initialize tuplesort object.
         */
@@ -295,7 +301,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool 
use_tuples)
                                                                                
                   qstate->sortNullsFirsts,
                                                                                
                   work_mem,
                                                                                
                   NULL,
-                                                                               
                   qstate->rescan_needed);
+                                                                               
                   tuplesortopt);
        else
                osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
                                                                                
                        qstate->sortOperator,
@@ -303,7 +309,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool 
use_tuples)
                                                                                
                        qstate->sortNullsFirst,
                                                                                
                        work_mem,
                                                                                
                        NULL,
-                                                                               
                        qstate->rescan_needed);
+                                                                               
                        tuplesortopt);
 
        osastate->number_of_rows = 0;
        osastate->sort_done = false;
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 086e948fca..5cc1328feb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -246,7 +246,7 @@ struct Tuplesortstate
 {
        TupSortStatus status;           /* enumerated value as shown above */
        int                     nKeys;                  /* number of columns in 
sort key */
-       bool            randomAccess;   /* did caller request random access? */
+       int                     sortopt;                /* Bitmask of flags 
used to setup sort */
        bool            bounded;                /* did caller specify a maximum 
number of
                                                                 * tuples to 
return? */
        bool            boundUsed;              /* true if we made use of a 
bounded heap */
@@ -558,12 +558,12 @@ struct Sharedsort
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
- * If state->randomAccess is true, then the stored representation of the
- * tuple must be followed by another "unsigned int" that is a copy of the
- * length --- so the total tape space used is actually sizeof(unsigned int)
- * more than the stored length value.  This allows read-backwards.  When
- * randomAccess is not true, the write/read routines may omit the extra
- * length word.
+ * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored
+ * representation of the tuple must be followed by another "unsigned int" that
+ * is a copy of the length --- so the total tape space used is actually
+ * sizeof(unsigned int) more than the stored length value.  This allows
+ * read-backwards.  When the random access flag was not specified, the
+ * write/read routines may omit the extra length word.
  *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
@@ -608,7 +608,7 @@ struct Sharedsort
 
 static Tuplesortstate *tuplesort_begin_common(int workMem,
                                                                                
          SortCoordinate coordinate,
-                                                                               
          bool randomAccess);
+                                                                               
          int sortopt);
 static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
@@ -713,21 +713,20 @@ static void tuplesort_updatemax(Tuplesortstate *state);
  * Each variant of tuplesort_begin has a workMem parameter specifying the
  * maximum number of kilobytes of RAM to use before spilling data to disk.
  * (The normal value of this parameter is work_mem, but some callers use
- * other values.)  Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.)  Each variant also has a sortopt which is a bitmask of
+ * sort options.  See TUPLESORT_* definitions in tuplesort.h
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, SortCoordinate coordinate,
-                                          bool randomAccess)
+tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
 {
        Tuplesortstate *state;
        MemoryContext maincontext;
        MemoryContext sortcontext;
        MemoryContext oldcontext;
 
-       /* See leader_takeover_tapes() remarks on randomAccess support */
-       if (coordinate && randomAccess)
+       /* See leader_takeover_tapes() remarks on random access support */
+       if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS))
                elog(ERROR, "random access disallowed under parallel sort");
 
        /*
@@ -764,7 +763,7 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
                pg_rusage_init(&state->ru_start);
 #endif
 
-       state->randomAccess = randomAccess;
+       state->sortopt = sortopt;
        state->tuples = true;
 
        /*
@@ -898,10 +897,10 @@ tuplesort_begin_heap(TupleDesc tupDesc,
                                         int nkeys, AttrNumber *attNums,
                                         Oid *sortOperators, Oid 
*sortCollations,
                                         bool *nullsFirstFlags,
-                                        int workMem, SortCoordinate 
coordinate, bool randomAccess)
+                                        int workMem, SortCoordinate 
coordinate, int sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        MemoryContext oldcontext;
        int                     i;
 
@@ -913,7 +912,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
        if (trace_sort)
                elog(LOG,
                         "begin tuple sort: nkeys = %d, workMem = %d, 
randomAccess = %c",
-                        nkeys, workMem, randomAccess ? 't' : 'f');
+                        nkeys, workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' 
: 'f');
 #endif
 
        state->nKeys = nkeys;
@@ -922,7 +921,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
                                                                false,  /* no 
unique check */
                                                                nkeys,
                                                                workMem,
-                                                               randomAccess,
+                                                               sortopt & 
TUPLESORT_RANDOMACCESS,
                                                                
PARALLEL_SORT(state));
 
        state->comparetup = comparetup_heap;
@@ -971,10 +970,10 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
                                                Relation indexRel,
                                                int workMem,
-                                               SortCoordinate coordinate, bool 
randomAccess)
+                                               SortCoordinate coordinate, int 
sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        BTScanInsert indexScanKey;
        MemoryContext oldcontext;
        int                     i;
@@ -988,7 +987,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
                elog(LOG,
                         "begin tuple sort: nkeys = %d, workMem = %d, 
randomAccess = %c",
                         RelationGetNumberOfAttributes(indexRel),
-                        workMem, randomAccess ? 't' : 'f');
+                        workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
        state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -997,7 +996,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
                                                                false,  /* no 
unique check */
                                                                state->nKeys,
                                                                workMem,
-                                                               randomAccess,
+                                                               sortopt & 
TUPLESORT_RANDOMACCESS,
                                                                
PARALLEL_SORT(state));
 
        state->comparetup = comparetup_cluster;
@@ -1069,10 +1068,10 @@ tuplesort_begin_index_btree(Relation heapRel,
                                                        bool 
uniqueNullsNotDistinct,
                                                        int workMem,
                                                        SortCoordinate 
coordinate,
-                                                       bool randomAccess)
+                                                       int sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        BTScanInsert indexScanKey;
        MemoryContext oldcontext;
        int                     i;
@@ -1084,7 +1083,7 @@ tuplesort_begin_index_btree(Relation heapRel,
                elog(LOG,
                         "begin index sort: unique = %c, workMem = %d, 
randomAccess = %c",
                         enforceUnique ? 't' : 'f',
-                        workMem, randomAccess ? 't' : 'f');
+                        workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
        state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -1093,7 +1092,7 @@ tuplesort_begin_index_btree(Relation heapRel,
                                                                enforceUnique,
                                                                state->nKeys,
                                                                workMem,
-                                                               randomAccess,
+                                                               sortopt & 
TUPLESORT_RANDOMACCESS,
                                                                
PARALLEL_SORT(state));
 
        state->comparetup = comparetup_index_btree;
@@ -1150,10 +1149,10 @@ tuplesort_begin_index_hash(Relation heapRel,
                                                   uint32 max_buckets,
                                                   int workMem,
                                                   SortCoordinate coordinate,
-                                                  bool randomAccess)
+                                                  int sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        MemoryContext oldcontext;
 
        oldcontext = MemoryContextSwitchTo(state->maincontext);
@@ -1166,7 +1165,8 @@ tuplesort_begin_index_hash(Relation heapRel,
                         high_mask,
                         low_mask,
                         max_buckets,
-                        workMem, randomAccess ? 't' : 'f');
+                        workMem,
+                        sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
        state->nKeys = 1;                       /* Only one sort column, the 
hash code */
@@ -1193,10 +1193,10 @@ tuplesort_begin_index_gist(Relation heapRel,
                                                   Relation indexRel,
                                                   int workMem,
                                                   SortCoordinate coordinate,
-                                                  bool randomAccess)
+                                                  int sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        MemoryContext oldcontext;
        int                     i;
 
@@ -1206,7 +1206,7 @@ tuplesort_begin_index_gist(Relation heapRel,
        if (trace_sort)
                elog(LOG,
                         "begin index sort: workMem = %d, randomAccess = %c",
-                        workMem, randomAccess ? 't' : 'f');
+                        workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
        state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -1248,10 +1248,10 @@ tuplesort_begin_index_gist(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
                                          bool nullsFirstFlag, int workMem,
-                                         SortCoordinate coordinate, bool 
randomAccess)
+                                         SortCoordinate coordinate, int 
sortopt)
 {
        Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-                                                                               
                   randomAccess);
+                                                                               
                   sortopt);
        MemoryContext oldcontext;
        int16           typlen;
        bool            typbyval;
@@ -1262,7 +1262,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, 
Oid sortCollation,
        if (trace_sort)
                elog(LOG,
                         "begin datum sort: workMem = %d, randomAccess = %c",
-                        workMem, randomAccess ? 't' : 'f');
+                        workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
        state->nKeys = 1;                       /* always a one-column sort */
@@ -2165,7 +2165,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool 
forward,
        switch (state->status)
        {
                case TSS_SORTEDINMEM:
-                       Assert(forward || state->randomAccess);
+                       Assert(forward || state->sortopt & 
TUPLESORT_RANDOMACCESS);
                        Assert(!state->slabAllocatorUsed);
                        if (forward)
                        {
@@ -2209,7 +2209,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool 
forward,
                        break;
 
                case TSS_SORTEDONTAPE:
-                       Assert(forward || state->randomAccess);
+                       Assert(forward || state->sortopt & 
TUPLESORT_RANDOMACCESS);
                        Assert(state->slabAllocatorUsed);
 
                        /*
@@ -2984,7 +2984,8 @@ mergeruns(Tuplesortstate *state)
                         * sorted tape, we can stop at this point and do the 
final merge
                         * on-the-fly.
                         */
-                       if (!state->randomAccess && state->nInputRuns <= 
state->nInputTapes
+                       if ((state->sortopt & TUPLESORT_RANDOMACCESS) == 0
+                               && state->nInputRuns <= state->nInputTapes
                                && !WORKER(state))
                        {
                                /* Tell logtape.c we won't be writing anymore */
@@ -3230,7 +3231,7 @@ tuplesort_rescan(Tuplesortstate *state)
 {
        MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-       Assert(state->randomAccess);
+       Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
        switch (state->status)
        {
@@ -3263,7 +3264,7 @@ tuplesort_markpos(Tuplesortstate *state)
 {
        MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-       Assert(state->randomAccess);
+       Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
        switch (state->status)
        {
@@ -3294,7 +3295,7 @@ tuplesort_restorepos(Tuplesortstate *state)
 {
        MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-       Assert(state->randomAccess);
+       Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
        switch (state->status)
        {
@@ -3853,7 +3854,7 @@ writetup_heap(Tuplesortstate *state, LogicalTape *tape, 
SortTuple *stup)
 
        LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
        LogicalTapeWrite(tape, (void *) tupbody, tupbodylen);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 
        if (!state->slabAllocatorUsed)
@@ -3876,7 +3877,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
        /* read in the tuple proper */
        tuple->t_len = tuplen;
        LogicalTapeReadExact(tape, tupbody, tupbodylen);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
        stup->tuple = (void *) tuple;
        /* set up first-column key value */
@@ -4087,7 +4088,7 @@ writetup_cluster(Tuplesortstate *state, LogicalTape 
*tape, SortTuple *stup)
        LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
        LogicalTapeWrite(tape, &tuple->t_self, sizeof(ItemPointerData));
        LogicalTapeWrite(tape, tuple->t_data, tuple->t_len);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 
        if (!state->slabAllocatorUsed)
@@ -4113,7 +4114,7 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
        tuple->t_tableOid = InvalidOid;
        /* Read in the tuple body */
        LogicalTapeReadExact(tape, tuple->t_data, tuple->t_len);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
        stup->tuple = (void *) tuple;
        /* set up first-column key value, if it's a simple column */
@@ -4337,7 +4338,7 @@ writetup_index(Tuplesortstate *state, LogicalTape *tape, 
SortTuple *stup)
        tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
        LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
        LogicalTapeWrite(tape, (void *) tuple, IndexTupleSize(tuple));
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 
        if (!state->slabAllocatorUsed)
@@ -4355,7 +4356,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
        IndexTuple      tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
        LogicalTapeReadExact(tape, tuple, tuplen);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
        stup->tuple = (void *) tuple;
        /* set up first-column key value */
@@ -4425,7 +4426,7 @@ writetup_datum(Tuplesortstate *state, LogicalTape *tape, 
SortTuple *stup)
 
        LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen));
        LogicalTapeWrite(tape, waddr, tuplen);
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeWrite(tape, (void *) &writtenlen, 
sizeof(writtenlen));
 
        if (!state->slabAllocatorUsed && stup->tuple)
@@ -4465,7 +4466,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
                stup->tuple = raddr;
        }
 
-       if (state->randomAccess)        /* need trailing length word? */
+       if (state->sortopt & TUPLESORT_RANDOMACCESS)    /* need trailing length 
word? */
                LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
 }
 
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index da5ba59198..345f4ce802 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -86,6 +86,12 @@ typedef enum
        SORT_SPACE_TYPE_MEMORY
 } TuplesortSpaceType;
 
+/* Bitwise option flags for tuple sorts */
+#define TUPLESORT_NONE                                 0
+
+/* specifies whether non-sequential access to the sort result is required */
+#define        TUPLESORT_RANDOMACCESS                  (1 << 0)
+
 typedef struct TuplesortInstrumentation
 {
        TuplesortMethod sortMethod; /* sort algorithm used */
@@ -201,32 +207,33 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc 
tupDesc,
                                                                                
        Oid *sortOperators, Oid *sortCollations,
                                                                                
        bool *nullsFirstFlags,
                                                                                
        int workMem, SortCoordinate coordinate,
-                                                                               
        bool randomAccess);
+                                                                               
        int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
                                                                                
           Relation indexRel, int workMem,
-                                                                               
           SortCoordinate coordinate, bool randomAccess);
+                                                                               
           SortCoordinate coordinate,
+                                                                               
           int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
                                                                                
                   Relation indexRel,
                                                                                
                   bool enforceUnique,
                                                                                
                   bool uniqueNullsNotDistinct,
                                                                                
                   int workMem, SortCoordinate coordinate,
-                                                                               
                   bool randomAccess);
+                                                                               
                   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
                                                                                
                  Relation indexRel,
                                                                                
                  uint32 high_mask,
                                                                                
                  uint32 low_mask,
                                                                                
                  uint32 max_buckets,
                                                                                
                  int workMem, SortCoordinate coordinate,
-                                                                               
                  bool randomAccess);
+                                                                               
                  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
                                                                                
                  Relation indexRel,
                                                                                
                  int workMem, SortCoordinate coordinate,
-                                                                               
                  bool randomAccess);
+                                                                               
                  int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
                                                                                
         Oid sortOperator, Oid sortCollation,
                                                                                
         bool nullsFirstFlag,
                                                                                
         int workMem, SortCoordinate coordinate,
-                                                                               
         bool randomAccess);
+                                                                               
         int sortopt);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 extern bool tuplesort_used_bound(Tuplesortstate *state);
-- 
2.32.0

From 3dd9284a381356c1a6b2a97977df148c388c0df1 Mon Sep 17 00:00:00 2001
From: David Rowley <[email protected]>
Date: Fri, 1 Apr 2022 21:38:03 +1300
Subject: [PATCH v5 3/3] Use Generation memory contexts to store tuples in
 sorts

The general usage pattern when we store tuples in tuplesort.c is that
we store a series of tuples one by one then either perform a sort or spill
them to disk.  In the common case, there is no pfreeing of already stored
tuples.  For the common case since we do not individually pfree tuples, we
have very little need for aset.c memory allocation behavior which
maintains freelists and always rounds allocation sizes up to the next
power of 2 size.

Here we conditionally use generation.c contexts for storing tuples in
tuplesort.c when the sort will never be bounded.  Unfortunately, the
memory context to store tuples is already created by the time any calls
would be made to tuplesort_set_bound(), so here we add a new sort option
that allows callers to specify if they're going to need a bounded sort or
not.  We'll use a standard aset.c allocator when this sort option is not
set.

Author: David Rowley
Discussion: 
https://postgr.es/m/caaphdvoh4aszsaoyhcxkuy01qf++8jj0paw+03dk+w25tqe...@mail.gmail.com
---
 src/backend/executor/nodeIncrementalSort.c |  6 ++++--
 src/backend/executor/nodeSort.c            |  2 ++
 src/backend/utils/sort/tuplesort.c         | 20 ++++++++++++++++----
 src/include/utils/tuplesort.h              |  3 +++
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c 
b/src/backend/executor/nodeIncrementalSort.c
index 4f50bc845d..3c5e8c9a96 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
                                                                                
                &(plannode->sort.nullsFirst[nPresortedCols]),
                                                                                
                work_mem,
                                                                                
                NULL,
-                                                                               
                TUPLESORT_NONE);
+                                                                               
                node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
                node->prefixsort_state = prefixsort_state;
        }
        else
@@ -616,7 +616,9 @@ ExecIncrementalSort(PlanState *pstate)
                                                                                
                  plannode->sort.nullsFirst,
                                                                                
                  work_mem,
                                                                                
                  NULL,
-                                                                               
                  TUPLESORT_NONE);
+                                                                               
                  node->bounded ?
+                                                                               
                                TUPLESORT_ALLOWBOUNDED :
+                                                                               
                                TUPLESORT_NONE);
                        node->fullsort_state = fullsort_state;
                }
                else
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a113d73795..3c28d60c3e 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -99,6 +99,8 @@ ExecSort(PlanState *pstate)
 
                if (node->randomAccess)
                        tuplesortopts |= TUPLESORT_RANDOMACCESS;
+               if (node->bounded)
+                       tuplesortopts |= TUPLESORT_ALLOWBOUNDED;
 
                if (node->datumSort)
                        tuplesortstate = 
tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 5cc1328feb..e44a3560de 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -842,11 +842,21 @@ tuplesort_begin_batch(Tuplesortstate *state)
         * eases memory management.  Resetting at key points reduces
         * fragmentation. Note that the memtuples array of SortTuples is 
allocated
         * in the parent context, not this context, because there is no need to
-        * free memtuples early.
+        * free memtuples early.  For bounded sorts, tuples may be pfreed in any
+        * order, so we use a regular aset.c context so that it can make use of
+        * free'd memory.  When the sort is not bounded, we make use of a
+        * generation.c context as this keeps allocations more compact with less
+        * wastage.  Allocations are also slightly more CPU efficient.
         */
-       state->tuplecontext = AllocSetContextCreate(state->sortcontext,
-                                                                               
                "Caller tuples",
-                                                                               
                ALLOCSET_DEFAULT_SIZES);
+       if (state->sortopt & TUPLESORT_ALLOWBOUNDED)
+               state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+                                                                               
                        "Caller tuples",
+                                                                               
                        ALLOCSET_DEFAULT_SIZES);
+       else
+               state->tuplecontext = 
GenerationContextCreate(state->sortcontext,
+                                                                               
                          "Caller tuples",
+                                                                               
                          ALLOCSET_DEFAULT_SIZES);
+
 
        state->status = TSS_INITIAL;
        state->bounded = false;
@@ -1337,6 +1347,8 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 {
        /* Assert we're called before loading any tuples */
        Assert(state->status == TSS_INITIAL && state->memtupcount == 0);
+       /* Assert we allow bounded sorts */
+       Assert(state->sortopt & TUPLESORT_ALLOWBOUNDED);
        /* Can't set the bound twice, either */
        Assert(!state->bounded);
        /* Also, this shouldn't be called in a parallel worker */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 345f4ce802..364cf132fc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -92,6 +92,9 @@ typedef enum
 /* specifies whether non-sequential access to the sort result is required */
 #define        TUPLESORT_RANDOMACCESS                  (1 << 0)
 
+/* specifies if the tuplesort is able to support bounded sorts */
+#define TUPLESORT_ALLOWBOUNDED                 (1 << 1)
+
 typedef struct TuplesortInstrumentation
 {
        TuplesortMethod sortMethod; /* sort algorithm used */
-- 
2.32.0

Re: Use generation context to speed up tuplesorts

Reply via email to