On 01/07/2017 12:33 AM, Peter Geoghegan wrote:
Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

 * Write a block-sized buffer to the specified block of the underlying file.
 * NB: should not attempt to write beyond current end of file (ie, create
 * "holes" in file), since BufFile doesn't allow that.  The first write pass
 * must write blocks sequentially.

I completely missed that rule :-(.

However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

    while (size > 0)
        if (lt->pos >= TapeBlockPayloadSize)

             * First allocate the next block, so that we can store it in the
             * 'next' pointer of this block.
            nextBlockNumber = ltsGetFreeBlock(lts);

            /* set the next-pointer and dump the current block. */
            TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
            ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);


Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.

Hmm. The LogicalTapeEndWriting() function is very similar to the LogicalTapePause() function I had in the pause/resume patch. They both flush the last block to disk. The difference is that LogicalTapePause() also free'd the buffer, and read back the last block from the disk if you continued writing, while LogicalTapeEndWriting() keeps a copy of it in memory.

With the proposed fix (or with the pause/resume), you can only write to a single tape at a time. Not a problem at the moment, but something to consider. At least, would need more comments to make that more clear, and an Assert would be nice.

Alternatively, we could fix this with a small change in ltsWriteBlock(), see attached patch. When we're about to create a hole in the file, write all-zero blocks to avoid the creating hole, before the block itself. That's not quite as efficient as writing the actual block contents into the hole, which avoids having to write it later, but probably won't make any measurable difference in practice. I'm inclined to do that, because it relaxes the rules on what you're allowed to do, in what order, which makes this more robust in general. We coul *also* have something like LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it doesn't seem that important.

I have a test case.
> ...

Thanks for the analysis!

- Heikki

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 83e9424537..4a64c53af5 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -152,7 +152,17 @@ typedef struct LogicalTape
 struct LogicalTapeSet
        BufFile    *pfile;                      /* underlying file for whole 
tape set */
-       long            nFileBlocks;    /* # of blocks used in underlying file 
+       /*
+        * File size tracking. nBlocksWritten is the size of the underlying 
+        * in BLCKSZ blocks. nBlocksReserved is the number of blocks allocated 
+        * ltsGetFreeBlock(), and it is always greater than or equal to
+        * nBlocksWritten. Blocks between nBlocksReserved and nBlocksWritten are
+        * blocks that have been allocated for a tape, but have not been written
+        * to the underlying file yet.
+        */
+       long            nBlocksReserved;        /* # of blocks allocated by 
ltsGetFreeBlock */
+       long            nBlocksWritten;         /* # of blocks used in 
underlying file */
         * We store the numbers of recycled-and-available blocks in 
@@ -187,21 +197,36 @@ static void ltsReleaseBlock(LogicalTapeSet *lts, long 
  * Write a block-sized buffer to the specified block of the underlying file.
- * NB: should not attempt to write beyond current end of file (ie, create
- * "holes" in file), since BufFile doesn't allow that.  The first write pass
- * must write blocks sequentially.
- *
  * No need for an error return convention; we ereport() on any error.
 static void
 ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
+       /*
+        * BufFile does not support "holes", so if we're about to write a block
+        * that's past the current end of file, fill the space between the
+        * current end of file and the target block with zeros.
+        */
+       while (blocknum > lts->nBlocksWritten)
+       {
+               char            zerobuf[BLCKSZ];
+               MemSet(zerobuf, 0, sizeof(zerobuf));
+               ltsWriteBlock(lts, lts->nBlocksWritten, zerobuf);
+       }
+       /* Write the requested block */
        if (BufFileSeekBlock(lts->pfile, blocknum) != 0 ||
                BufFileWrite(lts->pfile, buffer, BLCKSZ) != BLCKSZ)
                                 errmsg("could not write block %ld of temporary 
file: %m",
+       /* Update nBlocksWritten, if we extended the file */
+       if (blocknum == lts->nBlocksWritten)
+               lts->nBlocksWritten++;
@@ -281,9 +306,6 @@ freeBlocks_cmp(const void *a, const void *b)
  * Select a currently unused block for writing to.
- *
- * NB: should only be called when writer is ready to write immediately,
- * to ensure that first write pass is sequential.
 static long
 ltsGetFreeBlock(LogicalTapeSet *lts)
@@ -304,7 +326,7 @@ ltsGetFreeBlock(LogicalTapeSet *lts)
                return lts->freeBlocks[--lts->nFreeBlocks];
-               return lts->nFileBlocks++;
+               return lts->nBlocksReserved++;
@@ -360,7 +382,8 @@ LogicalTapeSetCreate(int ntapes)
        lts = (LogicalTapeSet *) palloc(offsetof(LogicalTapeSet, tapes) +
* sizeof(LogicalTape));
        lts->pfile = BufFileCreateTemp(false);
-       lts->nFileBlocks = 0L;
+       lts->nBlocksReserved = 0L;
+       lts->nBlocksWritten = 0L;
        lts->forgetFreeSpace = false;
        lts->blocksSorted = true;       /* a zero-length array is sorted ... */
        lts->freeBlocksLen = 32;        /* reasonable initial guess */
@@ -858,5 +881,5 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 LogicalTapeSetBlocks(LogicalTapeSet *lts)
-       return lts->nFileBlocks;
+       return lts->nBlocksReserved;
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to