On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > Yeah, I don't think there's a big performance difference between the two > approaches. I'm not wedded to either approach. Whichever approach we use, my > main point was that it would be better to handle this in the logical tape > abstraction. In my approach, you would have those "in-memory tails" attached > to the last two tapes. In Peter's approach, you would have one tape that's > completely in memory, backed by the array. In either case, the tapes would > look like normal tapes to most of tuplesort.c. There would be no extra TSS > state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.
TBH, it's not clear to me why making the logical tape abstraction manage some fraction of memtuples has any advantage at all. The only case that I can imagine where this could be useful is where a logical tape does asynchronous I/O, and has the benefit of some portion of memtuples (probably half) as scratch space (most I/O probably occurs while tuplesort quicksorts the other half of memtuples). But that has nothing to do with building a better abstraction. The more expansive patch that I'm currently working on, that covers every external sorting case -- the patch to *also* quicksort runs past the first run, regardless of whether or not we can get away with a "quicksort with spillover" -- is going very well. I haven't had a solid block of time to work on it this week due to other commitments, but it isn't very complicated. Performance benefits are considerable even without saving any I/O. I can be as much as twice as fast for "merge sort" sorts in some cases. So not quite as nice as "quicksort with spillover", but still a significant improvement considering writing everything out is inevitable for the cases helped. As I said before, the upcoming patch has tuplesort give up on memtuples *ever* being a heap after the first run, whatever happens. I just quicksort and dump in batches past the first run. Since I give up on replacement selection sort only after the first run, I still have most of the upside of replacement selection, but little of the big downside of heap maintenance. This will result in smaller runs on average past the first run. I can give you at least 3 very strong arguments for why this is totally worth it in every case, but I'll wait till I'm asked for them. :-) One useful saving made in this upcoming multi-tape-run patch is that it never treats any non-current run as part of the heap beyond its run number, even when currentRun is the first (run 0). So no comparisons occur beyond the first run to maintain the heap invariant *even when the first run is current* -- tuples are simply appended that belong to the second run (we only do an extra comparison to determine that that's the run they belong in). So the second run (run 1) is not trusted to be heapified by dumptuples(), and is quicksorted (either by "quicksort with spillover", along with much of the first run, or on its own, when there are multiple conventional on-tape runs; it doesn't matter which way it is quicksorted). From there on, every run is quicksorted when memtuples fills, and written out entirely in memtuple sized batches. > The logical tape abstraction is currently too low-level for that. It's just > a tape of bytes, and tuplesort.c determines where a tuple begins and ends. > That needs to be changed so that the logical tape abstraction works > tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which > reads N bytes, you would have LogicalTapeReadTuple(), which reads next > tuple, and returns its length and the tuple itself. But that would be quite > sensible anyway. Why would it be sensible? I honestly wonder why you want to do things that way. What is the advantage of not having what I call the in-memory "subrun" managed by a logical tape? It's already nothing like any other type of run in several ways. Aside from being all in-memory, it is often much larger. It's special in that it kind of rejects the preliminary determination that some tuples within memtuples need to be in a second, traditional, on-tape run (because we can just quicksort everything and merge with the existing single on-tape run). Also, we now need tuplesort_gettuple_common() to ask READTUP() what to tell its caller about freeing memory that is allocated in within tuplesort.c directly. The memtuples array is already treated as an array, a heap, the head of each tape that is merged, and maybe one other thing that I forgot about offhand. The field SortTuple.tupindex has a total of 3 context-dependent meanings. Playing these kind of games with the memtuples array is very much something that happens already. More than anything else, I think that the new TSS_MEMTAPEMERGE state is justified as a special case because "quicksort with spillover" is legitimately a special case. Users will want to know how close they were to an internal sort when looking at EXPLAIN ANALYZE and so on. When cost_sort() is fixed to be a continuous function (which I think will be pretty nice for certain other problems), the reader of that code will want to know more about this "quicksort with spillover" special case that can save 99% of I/O for what is still classified as an external sort -- they will look in tuplesort.c for it, not the logical tape code. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers