On Tue, Sep 1, 2020 at 5:24 PM Peter Geoghegan <p...@bowt.ie> wrote: > On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera <alvhe...@2ndquadrant.com> > wrote: > > This open item hasn't received any replies. I think Peter knows how to > > fix it already, but no patch has been posted ... It'd be good to get a > > move on it. > > I picked this up again today.
One easy way to get logtape.c to behave in the same way as Postgres 12 for a multi-pass external sort (i.e. to use fewer blocks and to report the number of blocks used accurately) is to #define TAPE_WRITE_PREALLOC_MIN and TAPE_WRITE_PREALLOC_MAX to 1. So it looks like the problem is in the preallocation stuff added by commit 896ddf9b3cd, and not the new heap-based free list logic added by commit c02fdc92230. That's good news, because it means that the problem may be fairly well isolated -- commit 896ddf9b3cd was a pretty small and isolated thing. The comments in ltsWriteBlock() added by the 2017 bugfix commit 7ac4a389a7d clearly say that the zero block writing stuff is only supposed to happen at the edge of a tape boundary, which ought to be rare -- see the comment block in ltsWriteBlock(). And yet the new preallocation stuff explicitly relies on that it writing zero blocks much more frequently. I'm concerned that that can result in increased and unnecessary I/O, especially for sorts, but also for hash aggs that spill. I'm also concerned that having preallocated-but-allocated blocks confuses the accounting used by trace_sort/LogicalTapeSetBlocks(). Separately, it's possible to make the trace_sort/LogicalTapeSetBlocks() instrumentation agree with the filesystem by replacing the use of nBlocksAllocated within LogicalTapeSetBlocks() with nBlocksWritten -- that seems to make the instrumentation correct without changing the current behavior at all. But I'm not ready to endorse that approach, since it's not quite clear what nBlocksAllocated and nBlocksWritten mean right now -- those two fields were both added by the aforementioned 2017 bugfix commit, which introduced the "allocated vs written" distinction in the first place. We should totally disable the preallocation stuff for external sorts in any case. External sorts are naturally characterized by relatively large, distinct batching of reads and writes -- preallocation cannot help. -- Peter Geoghegan