On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote: > > And if there a way to pre-allocate larger chunks? Presumably we could > assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x > 8kB) > instead of just single block. I haven't seen anything like that in > tape.c, though ...
It turned out to be simple (at least a POC) so I threw together a patch. I just added a 32-element array of block numbers to each tape. When we need a new block, we retrieve a block number from that array; or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times. I reproduced the problem on a smaller scale (330M groups, ~30GB of memory on a 16GB box). Work_mem=64MB. The query is a simple distinct. Unpatched master: Sort: 250s HashAgg: 310s Patched master: Sort: 245s HashAgg: 262s That's a nice improvement for such a simple patch. We can tweak the number of blocks to preallocate, or do other things like double from a small number up to a maximum. Also, a proper patch would probably release the blocks back as free when the tape was rewound. As long as the number of block numbers to preallocate is not too large, I don't think we need to change the API. It seems fine for sort to do the same thing, even though there's not any benefit. Regards, Jeff Davis
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index bc8d56807e2..f44594191e1 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -110,6 +110,7 @@ typedef struct TapeBlockTrailer #define TapeBlockSetNBytes(buf, nbytes) \ (TapeBlockGetTrailer(buf)->next = -(nbytes)) +#define TAPE_WRITE_N_PREALLOC 32 /* * This data structure represents a single "logical tape" within the set @@ -151,6 +152,15 @@ typedef struct LogicalTape int max_size; /* highest useful, safe buffer_size */ int pos; /* next read/write position in buffer */ int nbytes; /* total # of valid bytes in buffer */ + + /* + * When multiple tapes are being written to concurrently (as in HashAgg), + * it's imporant to preallocate some block numbers to make writes more + * sequential. This doesn't preallocate the blocks themselves; only the + * block numbers. + */ + int nprealloc; + long prealloc[TAPE_WRITE_N_PREALLOC]; } LogicalTape; /* @@ -557,6 +567,7 @@ ltsInitTape(LogicalTape *lt) lt->max_size = MaxAllocSize; lt->pos = 0; lt->nbytes = 0; + lt->nprealloc = 0; } /* @@ -733,7 +744,13 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum, * First allocate the next block, so that we can store it in the * 'next' pointer of this block. */ - nextBlockNumber = ltsGetFreeBlock(lts); + if (lt->nprealloc == 0) + { + lt->nprealloc = TAPE_WRITE_N_PREALLOC; + for (int i = lt->nprealloc; i > 0; i--) + lt->prealloc[i - 1] = ltsGetFreeBlock(lts); + } + nextBlockNumber = lt->prealloc[--lt->nprealloc]; /* set the next-pointer and dump the current block. */ TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;