Apologies for the delayed response to this. On Tue, Oct 4, 2016 at 3:47 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is > to put a cap on the number of tapes to use for the sort.
> That amounts to about 8% of the available memory. That's quite wasteful. > Peter's approach of putting an arbitrary cap on the max number of tapes > works, but I find it a bit hackish. And you still waste that 8% with smaller > work_mem settings. I don't think it's hackish -- there was never a theoretical justification for using as many tapes as possible from Knuth, or anyone else. I think that Simon's 2006 work on allowing for a number of tapes much greater than Knuth's "sweet spot" of 7 was useful only because it sometimes enabled final on-the-fly merging where we are not merging a huge number of runs (i.e. when merging only somewhat more than 7 runs). Your recent work on tape preloading has probably greatly diminished the value of not doing all merging on-the-fly in a single, final merge, though. Knuth's sweet spot of 7 had little or nothing to do with the economics of buying many tape drives. You shouldn't really waste 8% of the budget with low work_mem settings with my cap patch applied, because the new cap never limits the number of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you have about 1GB of work_mem. So, if there is still any waste at the low end, that can only be solved by tweaking the main calculation within tuplesort_merge_order(). (Also, to be clear to others following along: that memory is never actually allocated, so it's only "wasted" from the perspective of one sort operation alone). The cost of multiple passes is paid in sequential I/O of tape temp files, which is now clearly a relatively low cost. OTOH, the cost of a larger merge heap is paid in more CPU cache misses, which is a cost we can feel quite severely. While it's really hard to come up with a generic break-even point, I do think that there is value in capping the number of tapes somewhere in the hundreds. It's not like a cap on the number of tapes along the lines I've proposed was not thought about from day one, by both Tom and Simon. Noah's relatively recent MaxAllocSize work has given the issue new importance, though (the same might have been said during the 9.6 replacement selection vs. quicksort discussions, actually). > When we finish writing an initial run to a tape, we keep the tape buffers > around. That's the reason we need to reserve that memory. But there's no > fundamental reason we have to do that. As I suggested in [2], we could flush > the buffers to disk, and only keep in memory the location of the last block > we wrote. If we need to write another run to the tape, we can reload the > last incomplete block from the disk, and continue writing. Okay. But, you haven't actually addressed the problem with non-trivial amounts of memory being logically allocated ahead of time for no good reason -- you've only address the constant factor (the overhead per-tape). Can't we also fix the general problem, by applying a cap? Better to have a cap that is approximately right (in the hundreds or so) than one that is exactly wrong (infinity -- no cap). > Reloading the last block, requires an extra I/O. That's OK. It's quite rare > to have a multi-pass sort these days, so it's a good bet that you don't need > to do it. And if you have a very small work_mem, so that you need to do a > multi-pass sort, having some more memory available for building the initial > runs is probably worth it, and there's also a good chance that the block is > still in the OS cache. That analysis does seem sound to me. > In addition to saving a little bit of memory, I'd like to do this > refactoring because it simplifies the code. It's code that has stayed > largely unchanged for the past 15 years, so I'm not too eager to touch it, > but looking at the changes coming with Peter's parallel tuplesort patch set, > I think this makes sense. I can definitely see value in refactoring, to make that code less complicated -- I just don't think it's justified by the amount of memory that is wasted on tapes. That said, I'm not especially worried about the complexity within the logtape.c changes of the parallel CREATE INDEX patch alone. I'm much more interested in having a logtape.c that could be more easily made to support binary searching, etc, to find *partition* boundaries, which my CREATE INDEX patch doesn't use or care about. This is needed for tuplesort.c partition-based sorting. When parallel sort isn't just used by CREATE INDEX, partitioning becomes important. And, with partitioning, dynamic sampling is essential (I think that this will end up living in logtape.c). To recap on what I went into in the paritioning-to-parallel-tuplesort thread [1], I think that partitioning will come in a future release, and will be of more value to parallel queries, where much more can be pushed down within the executor; we really want to remove co-dependencies across workers early. I'm much less excited about how much that will benefit parallel CREATE INDEX, though, where partition-to-sort would need to deal with the complexity of *actually writing* the finished index in parallel (at least, to see any benefit), fixing up internal pages, etc. And, any benefit would be limited, because writing itself is I/O bound -- there is never too much to push down, unlike with real parallel query. (I feel fairly confident about this because my CREATE INDEX patch seems to make Postgres parallel CREATE INDEX have comparable scalability to the implementations in other major systems.) [1] https://www.postgresql.org/message-id/CAM3SWZR%2BATYAzyMT%2Bhm-Bo%3D1L1smtJbNDtibwBTKtYqS0dYZVg%40mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers