On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
> 
> And if there a way to pre-allocate larger chunks? Presumably we could
> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
> 8kB)
> instead of just single block. I haven't seen anything like that in
> tape.c, though ...

It turned out to be simple (at least a POC) so I threw together a
patch. I just added a 32-element array of block numbers to each tape.
When we need a new block, we retrieve a block number from that array;
or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.

I reproduced the problem on a smaller scale (330M groups, ~30GB of
memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.

Unpatched master:
   Sort: 250s
   HashAgg: 310s
Patched master:
   Sort: 245s
   HashAgg: 262s

That's a nice improvement for such a simple patch. We can tweak the
number of blocks to preallocate, or do other things like double from a
small number up to a maximum. Also, a proper patch would probably
release the blocks back as free when the tape was rewound.

As long as the number of block numbers to preallocate is not too large,
I don't think we need to change the API. It seems fine for sort to do
the same thing, even though there's not any benefit.

Regards,
        Jeff Davis


diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index bc8d56807e2..f44594191e1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -110,6 +110,7 @@ typedef struct TapeBlockTrailer
 #define TapeBlockSetNBytes(buf, nbytes) \
 	(TapeBlockGetTrailer(buf)->next = -(nbytes))
 
+#define TAPE_WRITE_N_PREALLOC 32
 
 /*
  * This data structure represents a single "logical tape" within the set
@@ -151,6 +152,15 @@ typedef struct LogicalTape
 	int			max_size;		/* highest useful, safe buffer_size */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * When multiple tapes are being written to concurrently (as in HashAgg),
+	 * it's imporant to preallocate some block numbers to make writes more
+	 * sequential. This doesn't preallocate the blocks themselves; only the
+	 * block numbers.
+	 */
+	int			nprealloc;
+	long		prealloc[TAPE_WRITE_N_PREALLOC];
 } LogicalTape;
 
 /*
@@ -557,6 +567,7 @@ ltsInitTape(LogicalTape *lt)
 	lt->max_size = MaxAllocSize;
 	lt->pos = 0;
 	lt->nbytes = 0;
+	lt->nprealloc = 0;
 }
 
 /*
@@ -733,7 +744,13 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 			 * First allocate the next block, so that we can store it in the
 			 * 'next' pointer of this block.
 			 */
-			nextBlockNumber = ltsGetFreeBlock(lts);
+			if (lt->nprealloc == 0)
+			{
+				lt->nprealloc = TAPE_WRITE_N_PREALLOC;
+				for (int i = lt->nprealloc; i > 0; i--)
+					lt->prealloc[i - 1] = ltsGetFreeBlock(lts);
+			}
+			nextBlockNumber = lt->prealloc[--lt->nprealloc];
 
 			/* set the next-pointer and dump the current block. */
 			TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;

Reply via email to