Re: Use generation context to speed up tuplesorts

Justin Pryzby Fri, 01 Apr 2022 06:26:16 -0700

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag,


Per cirrus, this missed updating one line for dtrace.

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> I feel this is a pretty simple patch and if nobody has any objections
> then I plan to push all 3 of them on my New Zealand Monday morning.

I'll mark the CF as such.

-- 
Justin

>From a6271ac626b561cdafe60152eb4fb208e7fe7240 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH 1/4] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com
---
 src/backend/access/gist/gistvacuum.c          |   6 +
 .../replication/logical/reorderbuffer.c       |   7 +
 src/backend/utils/mmgr/generation.c           | 383 ++++++++++++++----
 src/include/utils/memutils.h                  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index aac4afab8ff..f190decdff2 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * pages in page_set_context.  Internally, the integer set will remember
 	 * this context so that the subsequent allocations for these integer sets
 	 * will be done from the same context.
+	 *
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
 	 */
 	vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
 													  "GiST VACUUM page set context",
+													  16 * 1024,
+													  16 * 1024,
 													  16 * 1024);
 	oldctx = MemoryContextSwitchTo(vstate.page_set_context);
 	vstate.internal_page_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fae..4702750a2e7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -370,8 +370,15 @@ ReorderBufferAllocate(void)
 											SLAB_DEFAULT_BLOCK_SIZE,
 											sizeof(ReorderBufferTXN));
 
+	/*
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
 												  "Tuples",
+												  SLAB_LARGE_BLOCK_SIZE,
+												  SLAB_LARGE_BLOCK_SIZE,
 												  SLAB_LARGE_BLOCK_SIZE);
 
 	hash_ctl.keysize = sizeof(TransactionId);
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 95c006f689f..29065201556 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -20,18 +20,15 @@
  *
  *	The memory context uses a very simple approach to free space management.
  *	Instead of a complex global freelist, each block tracks a number
- *	of allocated and freed chunks. Freed chunks are not reused, and once all
- *	chunks in a block are freed, the whole block is thrown away. When the
- *	chunks allocated in the same block have similar lifespan, this works
- *	very well and is very cheap.
+ *	of allocated and freed chunks.  The block is classed as empty when the
+ *	number of free chunks is equal to the number of allocated chunks.  When
+ *	this occurs, instead of freeing the block, we try to "recycle" it, i.e.
+ *	reuse it for new allocations.  This is done by setting the block in the
+ *	context's 'freeblock' field.  If the freeblock field is already occupied
+ *	by another free block we simply return the newly empty block to malloc.
  *
- *	The current implementation only uses a fixed block size - maybe it should
- *	adapt a min/max block size range, and grow the blocks automatically.
- *	It already uses dedicated blocks for oversized chunks.
- *
- *	XXX It might be possible to improve this by keeping a small freelist for
- *	only a small number of recent blocks, but it's not clear it's worth the
- *	additional complexity.
+ *	This approach to free blocks requires fewer malloc/free calls for truely
+ *	first allocated, first free'd allocation patterns.
  *
  *-------------------------------------------------------------------------
  */
@@ -39,6 +36,7 @@
 #include "postgres.h"
 
 #include "lib/ilist.h"
+#include "port/pg_bitutils.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -46,6 +44,8 @@
 #define Generation_BLOCKHDRSZ	MAXALIGN(sizeof(GenerationBlock))
 #define Generation_CHUNKHDRSZ	sizeof(GenerationChunk)
 
+#define Generation_CHUNK_FRACTION	8
+
 typedef struct GenerationBlock GenerationBlock; /* forward reference */
 typedef struct GenerationChunk GenerationChunk;
 
@@ -60,20 +60,28 @@ typedef struct GenerationContext
 	MemoryContextData header;	/* Standard memory-context fields */
 
 	/* Generational context parameters */
-	Size		blockSize;		/* standard block size */
-
-	GenerationBlock *block;		/* current (most recently allocated) block */
+	Size		initBlockSize;	/* initial block size */
+	Size		maxBlockSize;	/* maximum block size */
+	Size		nextBlockSize;	/* next block size to allocate */
+	Size		allocChunkLimit;	/* effective chunk size limit */
+
+	GenerationBlock *block;		/* current (most recently allocated) block, or
+								 * NULL if we've just freed the most recent
+								 * block */
+	GenerationBlock *freeblock;	/* pointer to a block that's being recycled,
+								 * or NULL if there's no such block. */
+	GenerationBlock *keeper;	/* keep this block over resets */
 	dlist_head	blocks;			/* list of blocks */
 } GenerationContext;
 
 /*
  * GenerationBlock
  *		GenerationBlock is the unit of memory that is obtained by generation.c
- *		from malloc().  It contains one or more GenerationChunks, which are
+ *		from malloc().  It contains zero or more GenerationChunks, which are
  *		the units requested by palloc() and freed by pfree().  GenerationChunks
  *		cannot be returned to malloc() individually, instead pfree()
  *		updates the free counter of the block and when all chunks in a block
- *		are free the whole block is returned to malloc().
+ *		are free the whole block can be returned to malloc().
  *
  *		GenerationBlock is the header data for a block --- the usable space
  *		within the block begins at the next alignment boundary.
@@ -143,6 +151,14 @@ struct GenerationChunk
 #define GenerationChunkGetPointer(chk) \
 	((GenerationPointer *)(((char *)(chk)) + Generation_CHUNKHDRSZ))
 
+/* Inlined helper functions */
+static inline void GenerationBlockInit(GenerationBlock *block, Size blksize);
+static inline bool GenerationBlockIsEmpty(GenerationBlock *block);
+static inline void GenerationBlockMarkEmpty(GenerationBlock *block);
+static inline Size GenerationBlockFreeBytes(GenerationBlock *block);
+static inline void GenerationBlockFree(GenerationContext *set,
+									   GenerationBlock *block);
+
 /*
  * These functions implement the MemoryContext API for Generation contexts.
  */
@@ -191,14 +207,21 @@ static const MemoryContextMethods GenerationMethods = {
  *
  * parent: parent context, or NULL if top-level context
  * name: name of context (must be statically allocated)
- * blockSize: generation block size
+ * minContextSize: minimum context size
+ * initBlockSize: initial allocation block size
+ * maxBlockSize: maximum allocation block size
  */
 MemoryContext
 GenerationContextCreate(MemoryContext parent,
 						const char *name,
-						Size blockSize)
+						Size minContextSize,
+						Size initBlockSize,
+						Size maxBlockSize)
 {
+	Size		firstBlockSize;
+	Size		allocSize;
 	GenerationContext *set;
+	GenerationBlock	   *block;
 
 	/* Assert we padded GenerationChunk properly */
 	StaticAssertStmt(Generation_CHUNKHDRSZ == MAXALIGN(Generation_CHUNKHDRSZ),
@@ -208,24 +231,33 @@ GenerationContextCreate(MemoryContext parent,
 					 "padding calculation in GenerationChunk is wrong");
 
 	/*
-	 * First, validate allocation parameters.  (If we're going to throw an
-	 * error, we should do so before the context is created, not after.)  We
-	 * somewhat arbitrarily enforce a minimum 1K block size, mostly because
-	 * that's what AllocSet does.
+	 * First, validate allocation parameters.  Asserts seem sufficient because
+	 * nobody varies their parameters at runtime.  We somewhat arbitrarily
+	 * enforce a minimum 1K block size.
 	 */
-	if (blockSize != MAXALIGN(blockSize) ||
-		blockSize < 1024 ||
-		!AllocHugeSizeIsValid(blockSize))
-		elog(ERROR, "invalid blockSize for memory context: %zu",
-			 blockSize);
+	Assert(initBlockSize == MAXALIGN(initBlockSize) &&
+		   initBlockSize >= 1024);
+	Assert(maxBlockSize == MAXALIGN(maxBlockSize) &&
+		   maxBlockSize >= initBlockSize &&
+		   AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
+	Assert(minContextSize == 0 ||
+		   (minContextSize == MAXALIGN(minContextSize) &&
+			minContextSize >= 1024 &&
+			minContextSize <= maxBlockSize));
+
+	/* Determine size of initial block */
+	allocSize = MAXALIGN(sizeof(GenerationContext)) +
+		Generation_BLOCKHDRSZ + Generation_CHUNKHDRSZ;
+	if (minContextSize != 0)
+		allocSize = Max(allocSize, minContextSize);
+	else
+		allocSize = Max(allocSize, initBlockSize);
 
 	/*
-	 * Allocate the context header.  Unlike aset.c, we never try to combine
-	 * this with the first regular block, since that would prevent us from
-	 * freeing the first generation of allocations.
+	 * Allocate the initial block.  Unlike other generation.c blocks, it
+	 * starts with the context header and its block header follows that.
 	 */
-
-	set = (GenerationContext *) malloc(MAXALIGN(sizeof(GenerationContext)));
+	set = (GenerationContext *) malloc(allocSize);
 	if (set == NULL)
 	{
 		MemoryContextStats(TopMemoryContext);
@@ -240,11 +272,40 @@ GenerationContextCreate(MemoryContext parent,
 	 * Avoid writing code that can fail between here and MemoryContextCreate;
 	 * we'd leak the header if we ereport in this stretch.
 	 */
+	dlist_init(&set->blocks);
+
+	 /* Fill in the initial block's block header */
+	block = (GenerationBlock *) (((char *) set) + MAXALIGN(sizeof(GenerationContext)));
+	/* determine the block size and initialize it */
+	firstBlockSize = allocSize - MAXALIGN(sizeof(GenerationContext));
+	GenerationBlockInit(block, firstBlockSize);
+
+	/* add it to the doubly-linked list of blocks */
+	dlist_push_head(&set->blocks, &block->node);
+
+	/* use it as the current allocation block */
+	set->block = block;
+
+	/* No free block, yet */
+	set->freeblock = NULL;
+
+	/* Mark block as not to be released at reset time */
+	set->keeper = block;
 
 	/* Fill in GenerationContext-specific header fields */
-	set->blockSize = blockSize;
-	set->block = NULL;
-	dlist_init(&set->blocks);
+	set->initBlockSize = initBlockSize;
+	set->maxBlockSize = maxBlockSize;
+	set->nextBlockSize = initBlockSize;
+
+	/*
+	 * Compute the allocation chunk size limit for this context.
+	 *
+	 * Follows similar ideas as AllocSet, see aset.c for details ...
+	 */
+	set->allocChunkLimit = maxBlockSize;
+	while ((Size) (set->allocChunkLimit + Generation_CHUNKHDRSZ) >
+		   (Size) ((Size) (maxBlockSize - Generation_BLOCKHDRSZ) / Generation_CHUNK_FRACTION))
+		set->allocChunkLimit >>= 1;
 
 	/* Finally, do the type-independent part of context creation */
 	MemoryContextCreate((MemoryContext) set,
@@ -253,6 +314,8 @@ GenerationContextCreate(MemoryContext parent,
 						parent,
 						name);
 
+	((MemoryContext) set)->mem_allocated = firstBlockSize;
+
 	return (MemoryContext) set;
 }
 
@@ -276,24 +339,32 @@ GenerationReset(MemoryContext context)
 	GenerationCheck(context);
 #endif
 
+	/*
+	 * NULLify the free block pointer.  We must do this before calling
+	 * GenerationBlockFree as that function never expects to free the
+	 * freeblock.
+	 */
+	set->freeblock = NULL;
+
 	dlist_foreach_modify(miter, &set->blocks)
 	{
 		GenerationBlock *block = dlist_container(GenerationBlock, node, miter.cur);
 
-		dlist_delete(miter.cur);
-
-		context->mem_allocated -= block->blksize;
-
-#ifdef CLOBBER_FREED_MEMORY
-		wipe_mem(block, block->blksize);
-#endif
-
-		free(block);
+		if (block == set->keeper)
+			GenerationBlockMarkEmpty(block);
+		else
+			GenerationBlockFree(set, block);
 	}
 
-	set->block = NULL;
+	/* set it so new allocations to make use of the keeper block */
+	set->block = set->keeper;
 
-	Assert(dlist_is_empty(&set->blocks));
+	/* Reset block size allocation sequence, too */
+	set->nextBlockSize = set->initBlockSize;
+
+	/* Ensure there is only 1 item in the dlist */
+	Assert(!dlist_is_empty(&set->blocks));
+	Assert(!dlist_has_next(&set->blocks, dlist_head_node(&set->blocks)));
 }
 
 /*
@@ -303,9 +374,9 @@ GenerationReset(MemoryContext context)
 static void
 GenerationDelete(MemoryContext context)
 {
-	/* Reset to release all the GenerationBlocks */
+	/* Reset to release all releasable GenerationBlocks */
 	GenerationReset(context);
-	/* And free the context header */
+	/* And free the context header and keeper block */
 	free(context);
 }
 
@@ -329,11 +400,12 @@ GenerationAlloc(MemoryContext context, Size size)
 	GenerationBlock *block;
 	GenerationChunk *chunk;
 	Size		chunk_size = MAXALIGN(size);
+	Size		required_size = chunk_size + Generation_CHUNKHDRSZ;
 
 	/* is it an over-sized chunk? if yes, allocate special block */
-	if (chunk_size > set->blockSize / 8)
+	if (chunk_size > set->allocChunkLimit)
 	{
-		Size		blksize = chunk_size + Generation_BLOCKHDRSZ + Generation_CHUNKHDRSZ;
+		Size		blksize = required_size + Generation_BLOCKHDRSZ;
 
 		block = (GenerationBlock *) malloc(blksize);
 		if (block == NULL)
@@ -379,36 +451,76 @@ GenerationAlloc(MemoryContext context, Size size)
 	}
 
 	/*
-	 * Not an over-sized chunk. Is there enough space in the current block? If
-	 * not, allocate a new "regular" block.
+	 * Not an oversized chunk.  We try to first make use of the current
+	 * block, but if there's not enough space in it, instead of allocating a
+	 * new block, we look to see if the freeblock is empty and has enough
+	 * space.  If not, we'll also try the same using the keeper block.  The
+	 * keeper block may have become empty and we have no other way to reuse it
+	 * again if we don't try to use it explicitly here.
+	 *
+	 * We don't want to start filling the freeblock before the current block
+	 * is full, otherwise we may cause fragmentation in FIFO type workloads.
+	 * We only switch to using the freeblock or keeper block if those blocks
+	 * are completely empty.  If we didn't do that we could end up fragmenting
+	 * consecutive allocations over multiple blocks which would be a problem
+	 * that would compound over time.
 	 */
 	block = set->block;
 
-	if ((block == NULL) ||
-		(block->endptr - block->freeptr) < Generation_CHUNKHDRSZ + chunk_size)
+	if (block == NULL ||
+		GenerationBlockFreeBytes(block) < required_size)
 	{
-		Size		blksize = set->blockSize;
+		Size		blksize;
+		GenerationBlock *freeblock = set->freeblock;
 
-		block = (GenerationBlock *) malloc(blksize);
+		if (freeblock != NULL &&
+			GenerationBlockIsEmpty(freeblock) &&
+			GenerationBlockFreeBytes(freeblock) >= required_size)
+		{
+			block = freeblock;
 
-		if (block == NULL)
-			return NULL;
+			/* Zero out the freeblock as we'll set this to the current block below */
+			set->freeblock = NULL;
+		}
+		else if (GenerationBlockIsEmpty(set->keeper) &&
+				 GenerationBlockFreeBytes(set->keeper) >= required_size)
+		{
+			block = set->keeper;
+		}
+		else
+		{
+			/*
+			 * The first such block has size initBlockSize, and we double the
+			 * space in each succeeding block, but not more than maxBlockSize.
+			 */
+			blksize = set->nextBlockSize;
+			set->nextBlockSize <<= 1;
+			if (set->nextBlockSize > set->maxBlockSize)
+				set->nextBlockSize = set->maxBlockSize;
 
-		context->mem_allocated += blksize;
+			/* we'll need a block hdr too, so add that to the required size */
+			required_size += Generation_BLOCKHDRSZ;
 
-		block->blksize = blksize;
-		block->nchunks = 0;
-		block->nfree = 0;
+			/* round the size up to the next power of 2 */
+			if (blksize < required_size)
+				blksize = pg_nextpower2_size_t(required_size);
 
-		block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
-		block->endptr = ((char *) block) + blksize;
+			block = (GenerationBlock *) malloc(blksize);
 
-		/* Mark unallocated space NOACCESS. */
-		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
-								   blksize - Generation_BLOCKHDRSZ);
+			if (block == NULL)
+				return NULL;
 
-		/* add it to the doubly-linked list of blocks */
-		dlist_push_head(&set->blocks, &block->node);
+			context->mem_allocated += blksize;
+
+			/* initialize the new block */
+			GenerationBlockInit(block, blksize);
+
+			/* add it to the doubly-linked list of blocks */
+			dlist_push_head(&set->blocks, &block->node);
+
+			/* Zero out the freeblock in case it's become full */
+			set->freeblock = NULL;
+		}
 
 		/* and also use it as the current allocation block */
 		set->block = block;
@@ -453,6 +565,95 @@ GenerationAlloc(MemoryContext context, Size size)
 	return GenerationChunkGetPointer(chunk);
 }
 
+/*
+ * GenerationBlockInit
+ *		Initializes 'block' assuming 'blksize'.  Does not update the context's
+ *		mem_allocated field.
+ */
+static inline void
+GenerationBlockInit(GenerationBlock *block, Size blksize)
+{
+	block->blksize = blksize;
+	block->nchunks = 0;
+	block->nfree = 0;
+
+	block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
+	block->endptr = ((char *) block) + blksize;
+
+	/* Mark unallocated space NOACCESS. */
+	VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
+							   blksize - Generation_BLOCKHDRSZ);
+}
+
+/*
+ * GenerationBlockIsEmpty
+ *		Returns true iif 'block' contains no chunks
+ */
+static inline bool
+GenerationBlockIsEmpty(GenerationBlock *block)
+{
+	return (block->nchunks == 0);
+}
+
+/*
+ * GenerationBlockMarkEmpty
+ *		Set a block as empty.  Does not free the block.
+ */
+static inline void
+GenerationBlockMarkEmpty(GenerationBlock *block)
+{
+#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY)
+	char	   *datastart = ((char *) block) + Generation_BLOCKHDRSZ;
+#endif
+
+#ifdef CLOBBER_FREED_MEMORY
+	wipe_mem(datastart, block->freeptr - datastart);
+#else
+	/* wipe_mem() would have done this */
+	VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
+#endif
+
+	/* Reset the block, but don't return it to malloc */
+	block->nchunks = 0;
+	block->nfree = 0;
+	block->freeptr = ((char *) block) + Generation_BLOCKHDRSZ;
+
+}
+
+/*
+ * GenerationBlockFreeBytes
+ *		Returns the number of bytes free in 'block'
+ */
+static inline Size
+GenerationBlockFreeBytes(GenerationBlock *block)
+{
+	return (block->endptr - block->freeptr);
+}
+
+/*
+ * GenerationBlockFree
+ *		Remove 'block' from 'context' and release the memory consumed by it.
+ */
+static inline void
+GenerationBlockFree(GenerationContext *set, GenerationBlock *block)
+{
+	/* Make sure nobody tries to free the keeper block */
+	Assert(block != set->keeper);
+	/* We shouldn't be freeing the freeblock either */
+	Assert(block != set->freeblock);
+
+	/* release the block from the list of blocks */
+	dlist_delete(&block->node);
+
+	((MemoryContext) set)->mem_allocated -= block->blksize;
+
+#ifdef CLOBBER_FREED_MEMORY
+	wipe_mem(block, block->blksize);
+#endif
+
+	free(block);
+}
+
 /*
  * GenerationFree
  *		Update number of chunks in the block, and if all chunks in the block
@@ -499,16 +700,36 @@ GenerationFree(MemoryContext context, void *pointer)
 	if (block->nfree < block->nchunks)
 		return;
 
+	/* Don't try to free the keeper block, just mark it empty */
+	if (block == set->keeper)
+	{
+		GenerationBlockMarkEmpty(block);
+		return;
+	}
+
 	/*
-	 * The block is empty, so let's get rid of it. First remove it from the
-	 * list of blocks, then return it to malloc().
+	 * If there is no freeblock set or if this is the freeblock then instead
+	 * of freeing this memory, we keep it around so that new allocations have
+	 * the option of recycling it.
 	 */
-	dlist_delete(&block->node);
+	if (set->freeblock == NULL || set->freeblock == block)
+	{
+		/* XXX should we only recycle maxBlockSize sized blocks? */
+		set->freeblock = block;
+		GenerationBlockMarkEmpty(block);
+		return;
+	}
 
 	/* Also make sure the block is not marked as the current block. */
 	if (set->block == block)
 		set->block = NULL;
 
+	/*
+	 * The block is empty, so let's get rid of it. First remove it from the
+	 * list of blocks, then return it to malloc().
+	 */
+	dlist_delete(&block->node);
+
 	context->mem_allocated -= block->blksize;
 	free(block);
 }
@@ -655,8 +876,17 @@ static bool
 GenerationIsEmpty(MemoryContext context)
 {
 	GenerationContext *set = (GenerationContext *) context;
+	dlist_iter	iter;
 
-	return dlist_is_empty(&set->blocks);
+	dlist_foreach(iter, &set->blocks)
+	{
+		GenerationBlock *block = dlist_container(GenerationBlock, node, iter.cur);
+
+		if (block->nchunks > 0)
+			return false;
+	}
+
+	return true;
 }
 
 /*
@@ -746,12 +976,11 @@ GenerationCheck(MemoryContext context)
 		char	   *ptr;
 
 		total_allocated += block->blksize;
-
 		/*
-		 * nfree > nchunks is surely wrong, and we don't expect to see
-		 * equality either, because such a block should have gotten freed.
+		 * nfree > nchunks is surely wrong.  Equality is allowed as the block
+		 * might completely empty if it's the freeblock.
 		 */
-		if (block->nfree >= block->nchunks)
+		if (block->nfree > block->nchunks)
 			elog(WARNING, "problem in Generation %s: number of free chunks %d in block %p exceeds %d allocated",
 				 name, block->nfree, block, block->nchunks);
 
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 019700247a1..495d1af2010 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -183,7 +183,9 @@ extern MemoryContext SlabContextCreate(MemoryContext parent,
 /* generation.c */
 extern MemoryContext GenerationContextCreate(MemoryContext parent,
 											 const char *name,
-											 Size blockSize);
+											 Size minContextSize,
+											 Size initBlockSize,
+											 Size maxBlockSize);
 
 /*
  * Recommended default alloc parameters, suitable for "ordinary" contexts
-- 
2.17.1

>From 899ea66ad28b78aabd8596c35c1bff309709cd1c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Fri, 1 Apr 2022 20:04:58 +1300
Subject: [PATCH 2/4] Adjust tuplesort API to have bitwise flags instead of
 bool

For now we only had a single bool option for randomAccess, however an
upcoming patch requires adding another option, so instead of breaking the
API for that, then breaking it again one day if we add more options, let's
just break it once.  Any options we add in the future will just make use
of an unused bit in the flags.

Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
---
 src/backend/access/gist/gistbuild.c        |   2 +-
 src/backend/access/hash/hashsort.c         |   2 +-
 src/backend/access/heap/heapam_handler.c   |   2 +-
 src/backend/access/nbtree/nbtsort.c        |   6 +-
 src/backend/catalog/index.c                |   2 +-
 src/backend/executor/nodeAgg.c             |   6 +-
 src/backend/executor/nodeIncrementalSort.c |   4 +-
 src/backend/executor/nodeSort.c            |   8 +-
 src/backend/utils/adt/orderedsetaggs.c     |  10 +-
 src/backend/utils/sort/tuplesort.c         | 101 +++++++++++----------
 src/include/utils/tuplesort.h              |  19 ++--
 11 files changed, 90 insertions(+), 72 deletions(-)

diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index e081e6571a4..f5a5caff8ec 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -271,7 +271,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 														  index,
 														  maintenance_work_mem,
 														  NULL,
-														  false);
+														  TUPLESORT_NONE);
 
 		/* Scan the table, adding all tuples to the tuplesort */
 		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 6d8512283a8..aa61e39f26a 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -86,7 +86,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 												   hspool->max_buckets,
 												   maintenance_work_mem,
 												   NULL,
-												   false);
+												   TUPLESORT_NONE);
 
 	return hspool;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dee264e8596..3a9532cb4f7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -726,7 +726,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
 											maintenance_work_mem,
-											NULL, false);
+											NULL, TUPLESORT_NONE);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8a19de2f66c..15605dac6bb 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -436,7 +436,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
 									buildstate->nulls_not_distinct,
 									maintenance_work_mem, coordinate,
-									false);
+									TUPLESORT_NONE);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
@@ -475,7 +475,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 */
 		buildstate->spool2->sortstate =
 			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
-										coordinate2, false);
+										coordinate2, TUPLESORT_NONE);
 	}
 
 	/* Fill spool using either serial or parallel heap scan */
@@ -1939,7 +1939,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
 													 sortmem, coordinate,
-													 false);
+													 TUPLESORT_NONE);
 
 	/*
 	 * Just as with serial case, there may be a second spool.  If so, a
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index dd715ca0609..55800c94786 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3364,7 +3364,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
 											maintenance_work_mem,
-											NULL, false);
+											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
 	/* ambulkdelete updates progress metrics */
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 08cf569d8fa..23030a32a59 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -530,7 +530,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
-												  NULL, false);
+												  NULL, TUPLESORT_NONE);
 	}
 
 	aggstate->current_phase = newphase;
@@ -607,7 +607,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, NULL, false);
+									  work_mem, NULL, TUPLESORT_NONE);
 		}
 		else
 			pertrans->sortstates[aggstate->current_set] =
@@ -617,7 +617,7 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, NULL, false);
+									 work_mem, NULL, TUPLESORT_NONE);
 	}
 
 	/*
diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index d6fb56dec73..4f50bc845da 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.nullsFirst[nPresortedCols]),
 												work_mem,
 												NULL,
-												false);
+												TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
 	}
 	else
@@ -616,7 +616,7 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.nullsFirst,
 												  work_mem,
 												  NULL,
-												  false);
+												  TUPLESORT_NONE);
 			node->fullsort_state = fullsort_state;
 		}
 		else
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 9481a622bf5..a113d737955 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -77,6 +77,7 @@ ExecSort(PlanState *pstate)
 		Sort	   *plannode = (Sort *) node->ss.ps.plan;
 		PlanState  *outerNode;
 		TupleDesc	tupDesc;
+		int			tuplesortopts = TUPLESORT_NONE;
 
 		SO1_printf("ExecSort: %s\n",
 				   "sorting subplan");
@@ -96,6 +97,9 @@ ExecSort(PlanState *pstate)
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
+		if (node->randomAccess)
+			tuplesortopts |= TUPLESORT_RANDOMACCESS;
+
 		if (node->datumSort)
 			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
 												   plannode->sortOperators[0],
@@ -103,7 +107,7 @@ ExecSort(PlanState *pstate)
 												   plannode->nullsFirst[0],
 												   work_mem,
 												   NULL,
-												   node->randomAccess);
+												   tuplesortopts);
 		else
 			tuplesortstate = tuplesort_begin_heap(tupDesc,
 												  plannode->numCols,
@@ -113,7 +117,7 @@ ExecSort(PlanState *pstate)
 												  plannode->nullsFirst,
 												  work_mem,
 												  NULL,
-												  node->randomAccess);
+												  tuplesortopts);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
 		node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 96dae6ec4a8..6d4f6b7dca2 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -118,6 +118,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	OSAPerQueryState *qstate;
 	MemoryContext gcontext;
 	MemoryContext oldcontext;
+	int			tuplesortopt;
 
 	/*
 	 * Check we're called as aggregate (and not a window function), and get
@@ -283,6 +284,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	osastate->qstate = qstate;
 	osastate->gcontext = gcontext;
 
+	tuplesortopt = TUPLESORT_NONE;
+
+	if (qstate->rescan_needed)
+		tuplesortopt |= TUPLESORT_RANDOMACCESS;
+
 	/*
 	 * Initialize tuplesort object.
 	 */
@@ -295,7 +301,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortNullsFirsts,
 												   work_mem,
 												   NULL,
-												   qstate->rescan_needed);
+												   tuplesortopt);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
@@ -303,7 +309,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 													qstate->sortNullsFirst,
 													work_mem,
 													NULL,
-													qstate->rescan_needed);
+													tuplesortopt);
 
 	osastate->number_of_rows = 0;
 	osastate->sort_done = false;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 086e948fca6..5cc1328febf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -246,7 +246,7 @@ struct Tuplesortstate
 {
 	TupSortStatus status;		/* enumerated value as shown above */
 	int			nKeys;			/* number of columns in sort key */
-	bool		randomAccess;	/* did caller request random access? */
+	int			sortopt;		/* Bitmask of flags used to setup sort */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
 	bool		boundUsed;		/* true if we made use of a bounded heap */
@@ -558,12 +558,12 @@ struct Sharedsort
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
- * If state->randomAccess is true, then the stored representation of the
- * tuple must be followed by another "unsigned int" that is a copy of the
- * length --- so the total tape space used is actually sizeof(unsigned int)
- * more than the stored length value.  This allows read-backwards.  When
- * randomAccess is not true, the write/read routines may omit the extra
- * length word.
+ * If state->sortopt contains TUPLESORT_RANDOMACCESS, then the stored
+ * representation of the tuple must be followed by another "unsigned int" that
+ * is a copy of the length --- so the total tape space used is actually
+ * sizeof(unsigned int) more than the stored length value.  This allows
+ * read-backwards.  When the random access flag was not specified, the
+ * write/read routines may omit the extra length word.
  *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
@@ -608,7 +608,7 @@ struct Sharedsort
 
 static Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
-											  bool randomAccess);
+											  int sortopt);
 static void tuplesort_begin_batch(Tuplesortstate *state);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
@@ -713,21 +713,20 @@ static void tuplesort_updatemax(Tuplesortstate *state);
  * Each variant of tuplesort_begin has a workMem parameter specifying the
  * maximum number of kilobytes of RAM to use before spilling data to disk.
  * (The normal value of this parameter is work_mem, but some callers use
- * other values.)  Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.)  Each variant also has a sortopt which is a bitmask of
+ * sort options.  See TUPLESORT_* definitions in tuplesort.h
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, SortCoordinate coordinate,
-					   bool randomAccess)
+tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
 {
 	Tuplesortstate *state;
 	MemoryContext maincontext;
 	MemoryContext sortcontext;
 	MemoryContext oldcontext;
 
-	/* See leader_takeover_tapes() remarks on randomAccess support */
-	if (coordinate && randomAccess)
+	/* See leader_takeover_tapes() remarks on random access support */
+	if (coordinate && (sortopt & TUPLESORT_RANDOMACCESS))
 		elog(ERROR, "random access disallowed under parallel sort");
 
 	/*
@@ -764,7 +763,7 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate,
 		pg_rusage_init(&state->ru_start);
 #endif
 
-	state->randomAccess = randomAccess;
+	state->sortopt = sortopt;
 	state->tuples = true;
 
 	/*
@@ -898,10 +897,10 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, SortCoordinate coordinate, bool randomAccess)
+					 int workMem, SortCoordinate coordinate, int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -913,7 +912,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	if (trace_sort)
 		elog(LOG,
 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
-			 nkeys, workMem, randomAccess ? 't' : 'f');
+			 nkeys, workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = nkeys;
@@ -922,7 +921,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 								false,	/* no unique check */
 								nkeys,
 								workMem,
-								randomAccess,
+								sortopt & TUPLESORT_RANDOMACCESS,
 								PARALLEL_SORT(state));
 
 	state->comparetup = comparetup_heap;
@@ -971,10 +970,10 @@ Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
 						int workMem,
-						SortCoordinate coordinate, bool randomAccess)
+						SortCoordinate coordinate, int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -988,7 +987,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		elog(LOG,
 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
 			 RelationGetNumberOfAttributes(indexRel),
-			 workMem, randomAccess ? 't' : 'f');
+			 workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -997,7 +996,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 								false,	/* no unique check */
 								state->nKeys,
 								workMem,
-								randomAccess,
+								sortopt & TUPLESORT_RANDOMACCESS,
 								PARALLEL_SORT(state));
 
 	state->comparetup = comparetup_cluster;
@@ -1069,10 +1068,10 @@ tuplesort_begin_index_btree(Relation heapRel,
 							bool uniqueNullsNotDistinct,
 							int workMem,
 							SortCoordinate coordinate,
-							bool randomAccess)
+							int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -1084,7 +1083,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		elog(LOG,
 			 "begin index sort: unique = %c, workMem = %d, randomAccess = %c",
 			 enforceUnique ? 't' : 'f',
-			 workMem, randomAccess ? 't' : 'f');
+			 workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -1093,7 +1092,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 								enforceUnique,
 								state->nKeys,
 								workMem,
-								randomAccess,
+								sortopt & TUPLESORT_RANDOMACCESS,
 								PARALLEL_SORT(state));
 
 	state->comparetup = comparetup_index_btree;
@@ -1150,10 +1149,10 @@ tuplesort_begin_index_hash(Relation heapRel,
 						   uint32 max_buckets,
 						   int workMem,
 						   SortCoordinate coordinate,
-						   bool randomAccess)
+						   int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->maincontext);
@@ -1166,7 +1165,8 @@ tuplesort_begin_index_hash(Relation heapRel,
 			 high_mask,
 			 low_mask,
 			 max_buckets,
-			 workMem, randomAccess ? 't' : 'f');
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = 1;			/* Only one sort column, the hash code */
@@ -1193,10 +1193,10 @@ tuplesort_begin_index_gist(Relation heapRel,
 						   Relation indexRel,
 						   int workMem,
 						   SortCoordinate coordinate,
-						   bool randomAccess)
+						   int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1206,7 +1206,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	if (trace_sort)
 		elog(LOG,
 			 "begin index sort: workMem = %d, randomAccess = %c",
-			 workMem, randomAccess ? 't' : 'f');
+			 workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
@@ -1248,10 +1248,10 @@ tuplesort_begin_index_gist(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
-					  SortCoordinate coordinate, bool randomAccess)
+					  SortCoordinate coordinate, int sortopt)
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   randomAccess);
+												   sortopt);
 	MemoryContext oldcontext;
 	int16		typlen;
 	bool		typbyval;
@@ -1262,7 +1262,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	if (trace_sort)
 		elog(LOG,
 			 "begin datum sort: workMem = %d, randomAccess = %c",
-			 workMem, randomAccess ? 't' : 'f');
+			 workMem, sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
 	state->nKeys = 1;			/* always a one-column sort */
@@ -2165,7 +2165,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	switch (state->status)
 	{
 		case TSS_SORTEDINMEM:
-			Assert(forward || state->randomAccess);
+			Assert(forward || state->sortopt & TUPLESORT_RANDOMACCESS);
 			Assert(!state->slabAllocatorUsed);
 			if (forward)
 			{
@@ -2209,7 +2209,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			break;
 
 		case TSS_SORTEDONTAPE:
-			Assert(forward || state->randomAccess);
+			Assert(forward || state->sortopt & TUPLESORT_RANDOMACCESS);
 			Assert(state->slabAllocatorUsed);
 
 			/*
@@ -2984,7 +2984,8 @@ mergeruns(Tuplesortstate *state)
 			 * sorted tape, we can stop at this point and do the final merge
 			 * on-the-fly.
 			 */
-			if (!state->randomAccess && state->nInputRuns <= state->nInputTapes
+			if ((state->sortopt & TUPLESORT_RANDOMACCESS) == 0
+				&& state->nInputRuns <= state->nInputTapes
 				&& !WORKER(state))
 			{
 				/* Tell logtape.c we won't be writing anymore */
@@ -3230,7 +3231,7 @@ tuplesort_rescan(Tuplesortstate *state)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-	Assert(state->randomAccess);
+	Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
 	switch (state->status)
 	{
@@ -3263,7 +3264,7 @@ tuplesort_markpos(Tuplesortstate *state)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-	Assert(state->randomAccess);
+	Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
 	switch (state->status)
 	{
@@ -3294,7 +3295,7 @@ tuplesort_restorepos(Tuplesortstate *state)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
 
-	Assert(state->randomAccess);
+	Assert(state->sortopt & TUPLESORT_RANDOMACCESS);
 
 	switch (state->status)
 	{
@@ -3853,7 +3854,7 @@ writetup_heap(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
 
 	LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 	LogicalTapeWrite(tape, (void *) tupbody, tupbodylen);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 
 	if (!state->slabAllocatorUsed)
@@ -3876,7 +3877,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	/* read in the tuple proper */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(tape, tupbody, tupbodylen);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
 	stup->tuple = (void *) tuple;
 	/* set up first-column key value */
@@ -4087,7 +4088,7 @@ writetup_cluster(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 	LogicalTapeWrite(tape, &tuple->t_self, sizeof(ItemPointerData));
 	LogicalTapeWrite(tape, tuple->t_data, tuple->t_len);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 
 	if (!state->slabAllocatorUsed)
@@ -4113,7 +4114,7 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 	tuple->t_tableOid = InvalidOid;
 	/* Read in the tuple body */
 	LogicalTapeReadExact(tape, tuple->t_data, tuple->t_len);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
 	stup->tuple = (void *) tuple;
 	/* set up first-column key value, if it's a simple column */
@@ -4337,7 +4338,7 @@ writetup_index(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
 	tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
 	LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 	LogicalTapeWrite(tape, (void *) tuple, IndexTupleSize(tuple));
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeWrite(tape, (void *) &tuplen, sizeof(tuplen));
 
 	if (!state->slabAllocatorUsed)
@@ -4355,7 +4356,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(tape, tuple, tuplen);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
 	stup->tuple = (void *) tuple;
 	/* set up first-column key value */
@@ -4425,7 +4426,7 @@ writetup_datum(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
 
 	LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen));
 	LogicalTapeWrite(tape, waddr, tuplen);
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeWrite(tape, (void *) &writtenlen, sizeof(writtenlen));
 
 	if (!state->slabAllocatorUsed && stup->tuple)
@@ -4465,7 +4466,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 		stup->tuple = raddr;
 	}
 
-	if (state->randomAccess)	/* need trailing length word? */
+	if (state->sortopt & TUPLESORT_RANDOMACCESS)	/* need trailing length word? */
 		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
 }
 
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index da5ba591989..345f4ce8024 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -86,6 +86,12 @@ typedef enum
 	SORT_SPACE_TYPE_MEMORY
 } TuplesortSpaceType;
 
+/* Bitwise option flags for tuple sorts */
+#define TUPLESORT_NONE					0
+
+/* specifies whether non-sequential access to the sort result is required */
+#define	TUPLESORT_RANDOMACCESS			(1 << 0)
+
 typedef struct TuplesortInstrumentation
 {
 	TuplesortMethod sortMethod; /* sort algorithm used */
@@ -201,32 +207,33 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 											Oid *sortOperators, Oid *sortCollations,
 											bool *nullsFirstFlags,
 											int workMem, SortCoordinate coordinate,
-											bool randomAccess);
+											int sortopt);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 											   Relation indexRel, int workMem,
-											   SortCoordinate coordinate, bool randomAccess);
+											   SortCoordinate coordinate,
+											   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
 												   int workMem, SortCoordinate coordinate,
-												   bool randomAccess);
+												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 												  Relation indexRel,
 												  uint32 high_mask,
 												  uint32 low_mask,
 												  uint32 max_buckets,
 												  int workMem, SortCoordinate coordinate,
-												  bool randomAccess);
+												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  Relation indexRel,
 												  int workMem, SortCoordinate coordinate,
-												  bool randomAccess);
+												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
 											 int workMem, SortCoordinate coordinate,
-											 bool randomAccess);
+											 int sortopt);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 extern bool tuplesort_used_bound(Tuplesortstate *state);
-- 
2.17.1

>From b07d6406fccc50ff59d4f48cbf81a55f7eed8ab6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryz...@telsasoft.com>
Date: Fri, 1 Apr 2022 07:00:57 -0500
Subject: [PATCH 3/4] f!

---
 src/backend/utils/sort/tuplesort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5cc1328febf..a0ab61e8f9a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1271,7 +1271,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 								false,	/* no unique check */
 								1,
 								workMem,
-								randomAccess,
+								sortopt & TUPLESORT_RANDOMACCESS,
 								PARALLEL_SORT(state));
 
 	state->comparetup = comparetup_datum;
-- 
2.17.1

>From 5a4904b1087795a6bb76ccb7d0ecf349dc10492f Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Fri, 1 Apr 2022 21:38:03 +1300
Subject: [PATCH 4/4] Use Generation memory contexts to store tuples in sorts

The general usage pattern when we store tuples in tuplesort.c is that
we store a series of tuples one by one then either perform a sort or spill
them to disk.  In the common case, there is no pfreeing of already stored
tuples.  For the common case since we do not individually pfree tuples, we
have very little need for aset.c memory allocation behavior which
maintains freelists and always rounds allocation sizes up to the next
power of 2 size.

Here we conditionally use generation.c contexts for storing tuples in
tuplesort.c when the sort will never be bounded.  Unfortunately, the
memory context to store tuples is already created by the time any calls
would be made to tuplesort_set_bound(), so here we add a new sort option
that allows callers to specify if they're going to need a bounded sort or
not.  We'll use a standard aset.c allocator when this sort option is not
set.

Author: David Rowley
Discussion: https://postgr.es/m/caaphdvoh4aszsaoyhcxkuy01qf++8jj0paw+03dk+w25tqe...@mail.gmail.com
---
 src/backend/executor/nodeIncrementalSort.c |  6 ++++--
 src/backend/executor/nodeSort.c            |  2 ++
 src/backend/utils/sort/tuplesort.c         | 20 ++++++++++++++++----
 src/include/utils/tuplesort.h              |  3 +++
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/nodeIncrementalSort.c b/src/backend/executor/nodeIncrementalSort.c
index 4f50bc845da..3c5e8c9a969 100644
--- a/src/backend/executor/nodeIncrementalSort.c
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -315,7 +315,7 @@ switchToPresortedPrefixMode(PlanState *pstate)
 												&(plannode->sort.nullsFirst[nPresortedCols]),
 												work_mem,
 												NULL,
-												TUPLESORT_NONE);
+												node->bounded ? TUPLESORT_ALLOWBOUNDED : TUPLESORT_NONE);
 		node->prefixsort_state = prefixsort_state;
 	}
 	else
@@ -616,7 +616,9 @@ ExecIncrementalSort(PlanState *pstate)
 												  plannode->sort.nullsFirst,
 												  work_mem,
 												  NULL,
-												  TUPLESORT_NONE);
+												  node->bounded ?
+														TUPLESORT_ALLOWBOUNDED :
+														TUPLESORT_NONE);
 			node->fullsort_state = fullsort_state;
 		}
 		else
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a113d737955..3c28d60c3ef 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -99,6 +99,8 @@ ExecSort(PlanState *pstate)
 
 		if (node->randomAccess)
 			tuplesortopts |= TUPLESORT_RANDOMACCESS;
+		if (node->bounded)
+			tuplesortopts |= TUPLESORT_ALLOWBOUNDED;
 
 		if (node->datumSort)
 			tuplesortstate = tuplesort_begin_datum(TupleDescAttr(tupDesc, 0)->atttypid,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index a0ab61e8f9a..613f55ca21b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -842,11 +842,21 @@ tuplesort_begin_batch(Tuplesortstate *state)
 	 * eases memory management.  Resetting at key points reduces
 	 * fragmentation. Note that the memtuples array of SortTuples is allocated
 	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * free memtuples early.  For bounded sorts, tuples may be pfreed in any
+	 * order, so we use a regular aset.c context so that it can make use of
+	 * free'd memory.  When the sort is not bounded, we make use of a
+	 * generation.c context as this keeps allocations more compact with less
+	 * wastage.  Allocations are also slightly more CPU efficient.
 	 */
-	state->tuplecontext = AllocSetContextCreate(state->sortcontext,
-												"Caller tuples",
-												ALLOCSET_DEFAULT_SIZES);
+	if (state->sortopt & TUPLESORT_ALLOWBOUNDED)
+		state->tuplecontext = AllocSetContextCreate(state->sortcontext,
+													"Caller tuples",
+													ALLOCSET_DEFAULT_SIZES);
+	else
+		state->tuplecontext = GenerationContextCreate(state->sortcontext,
+													  "Caller tuples",
+													  ALLOCSET_DEFAULT_SIZES);
+
 
 	state->status = TSS_INITIAL;
 	state->bounded = false;
@@ -1337,6 +1347,8 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 {
 	/* Assert we're called before loading any tuples */
 	Assert(state->status == TSS_INITIAL && state->memtupcount == 0);
+	/* Assert we allow bounded sorts */
+	Assert(state->sortopt & TUPLESORT_ALLOWBOUNDED);
 	/* Can't set the bound twice, either */
 	Assert(!state->bounded);
 	/* Also, this shouldn't be called in a parallel worker */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 345f4ce8024..364cf132fcb 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -92,6 +92,9 @@ typedef enum
 /* specifies whether non-sequential access to the sort result is required */
 #define	TUPLESORT_RANDOMACCESS			(1 << 0)
 
+/* specifies if the tuplesort is able to support bounded sorts */
+#define TUPLESORT_ALLOWBOUNDED			(1 << 1)
+
 typedef struct TuplesortInstrumentation
 {
 	TuplesortMethod sortMethod; /* sort algorithm used */
-- 
2.17.1

Re: Use generation context to speed up tuplesorts

Reply via email to