On Tue, Sep 8, 2020 at 11:28 PM Jeff Davis <pg...@j-davis.com> wrote: > Preallocation showed significant gains for HashAgg, and BufFile doesn't > support sparse writes. So, for HashAgg, it seems like we should just > update the comment and consider it the price of using BufFile.
> For Sort, we can just disable preallocation. +1. I think that you can push sort-no-prealloc.patch without delay. That looks good to me. > Right now, it seems nBlocksAllocated means "number of blocks returned > by ltsGetFreeBlock(), plus nHoleBlocks". Just to be clear: I'm assuming that we must honor the original intent of earlier code in my remarks here (in particular, code added by 7ac4a389a7d). This may not be exactly where we end up, but it's a good starting point. nHoleBlocks was added by the parallel CREATE INDEX commit in 2018, which added BufFile/logtape.c concatenation. Whereas nBlocksAllocated was added by the earlier bugfix commit 7ac4a389a7d in 2017 (the bugfix I've referenced several times upthread). Clearly nBlocksAllocated cannot be defined in terms of some other thing that wasn't there when it was first added (I refer to nHoleBlocks). In general nHoleBlocks can only be non-zero when the logical tapeset has been unified in a leader process (for a parallel CREATE INDEX). nBlocksAllocated is not the same thing as nBlocksWritten, though the difference is more subtle than you suggest. nBlocksAllocated actually means (or should mean) "the number of blocks allocated to the file", which is usually the same thing that a stat() call or tools like "ls" are expected report for the underlying temp file once the merge phase of an external sort is reached (assuming that you only need one temp file for a BufFile, and didn't use parallelism/concatenation, which is the common case). That's why nBlocksAllocated is what LogicalTapeSetBlocks() returns (pretty much). At least, the original post-7ac4a389a7d version of LogicalTapeSetBlocks() was just "return nBlocksAllocated;". nHoleBlocks was added for parallel CI, but that was only supposed to compensate for the holes left behind by concatenation/parallel sort, without changing the logtape.c space management design in any fundamental way. Obviously you must be wondering what the difference is, if it's not just the nHoleBlocks thing. nBlocksAllocated is not necessarily equal to nBlocksWritten (even when we ignore concatenation/nHoleBlocks), but it's almost always equal in practice (again, barring nHoleBlocks != 0). It's possible that a tuplesort will not have flushed the last block at a point when LogicalTapeSetBlocks() is called -- it will have allocated the block, but not yet written it to the BufFile. IOW, as far as tuplesort.c is concerned the data is written to tape, but it happens to not have been passed through to the OS via write(), or even passed through to BufFileWrite() -- it happens to still be in one of the small per-tape write buffers. When this occurs, a small amount of dirty data in said per-tape buffer is considered written by tuplesort.c, but from the point of view of logtape.c it is allocated but not yet "written" (by which I mean not yet passed to buffile.c, which actually does its own buffering, which it can neglect to flush immediately in turn). It's possible that tuplesort.c will need to call LogicalTapeSetBlocks() at an earlier point after all tuples are written but before they're "flushed" in logtape.c/buffile.c. We need to avoid confusion when that happens. We want to insulate tuplesort.c from implementation details that are private to logtape.c and/or buffile.c. Bear in mind that nBlocksAllocated was originally only ever supposed to have a value equal to nBlocksWritten, or the value nBlocksWritten + 1. It is reasonable to want to hide the buffering from LogicalTapeSetBlocks() once you realize that this mechanism is only supposed to smooth-over an edge case involving one extra block that will be written out in a moment anyway. What does all of this mean for the new preallocation stuff that benefits HashAggs that spill? Well, I'm not sure. I was specifically concerned that somebody would end up misusing the ltsWriteBlock() allocated-but-not-written thing in this way back in 2017, and said so at the time -- that's why commit 7ac4a389a7d added comments about all this to ltsWriteBlock(). For external sorts, that we're agreed won't be using preallocation anyway, I think that we should go back to reporting allocated blocks from LogicalTapeSetBlocks() -- very often this is nBlocksWritten, but occasionally it's nBlocksWritten + 1. I haven't yet refreshed my memory on the exact details of when you get one behavior rather than the other, but I know it is possible in practice with a tuplesort on Postgres 12. It might depend on subtle issues like the alignment with BufFile segments -- see my test case from 2017 to get an idea of how to make it easier to reveal problems in this area: https://www.postgresql.org/message-id/cam3swzrwdntkhig0gyix_1muaypik3dv6-6542pye2iel-f...@mail.gmail.com We still need to put the reliance on ltsWriteBlock() allocating many blocks before they've been logically written on some kind of formal footing for Postgres 13 -- it is now possible that an all-zero block will be left behind even after we're done writing and have flushed all temp buffers, which is a new thing. In cases when the zero-block-written thing happened on Postgres 12, we would later flush out a block that overwrote every zero block -- that happened reliably. ltsWriteBlock()'s loop only "preallocated" blocks it *knew* would get filled with real data shortly afterwards, as an implementation expedient -- not as an optimization. This is no longer the case. At a minimum, we need to update the old ltsWriteBlock() allocated-but-not-written comments to acknowledge that the HashAgg case exists and has different concerns. We must also determine whether we have the same issue with written-but-not-yet-flushed data for the new nodeAgg.c caller. You're not doing the ltsWriteBlock() loop-that-writes-zero-blocks thing because you have an unflushed buffer from another tape -- you're doing it to preallocate and avoid possible fragmentation. I'm mostly okay with doing the preallocation that way, but that needs to be reconciled with the original design. And the original design needs to continue to do the same things for tuplesort.c, and maybe nodeAgg.c, too. I think that the return value of LogicalTapeSetBlocks() should be at least nBlocksWritten, while also including blocks that we know that flushing dirty buffered data out will write in a moment, too (note that I'm still pretending nHoleBlocks doesn't exist because it's not important in my remarks here). IOW, it ought to include preallocated blocks (for HashAgg), while not failing to count one extra block that happens to still be buffered but is written as far as the logtape.c caller is concerned (certainly for tuplesort caller, and maybe for HashAgg caller too). > nBlocksWritten seems to mean "the logical size of the BufFile". The > BufFile can have holes in it after concatenation, but from the > perspective of logtape.c, nBlocksWritten seems like a better fit for > instrumentation purposes. So I'd be inclined to return (nBlocksWritten > - nHoleBlocks). > > The only thing I can think of that would be better is if BufFile > tracked for itself the logical vs. physical size, which might be a good > improvement to make (and would mean that logtape.c wouldn't be > responsible for tracking the holes itself). I don't really think that that's workable, for what it's worth. The "holes" left behind by concatenation (and counted by nHoleBlocks) are ranges that logtape.c can never reuse that are "between" worker tapes. They are necessary because logtape.c needs to be able to read back block number metadata from worker temp files (it makes sense of them by applying an offset). ISTM that the logical vs physical size distinction will have to be tracked by logtape.c for as long as it buffers data for writes. It's the natural way to do it IMV. -- Peter Geoghegan