On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <p...@heroku.com> wrote: > Since merging is a big bottleneck with this, we should probably also > work to address that indirectly.

## Advertising

I attach a patch that changes how we maintain the heap invariant during tuplesort merging. I already mentioned this over on the "Parallel tuplesort, partitioning, merging, and the future" thread. As noted already on that thread, this patch makes merging clustered numeric input about 2.1x faster overall in one case, which is particularly useful in the context of a serial final/leader merge during a parallel CREATE INDEX. Even *random* non-C-collated text input is made significantly faster. This work is totally orthogonal to parallelism, though; it's just very timely, given our discussion of the merge bottleneck on this thread. If I benchmark a parallel build of a 100 million row index, with presorted input, I can see a 71% reduction in *comparisons* with 8 tapes/workers, and an 80% reduction in comparisons with 16 workers/tapes in one instance (the numeric case I just mentioned). With random input, we can still come out significantly ahead, but not to the same degree. I was able to see a reduction in comparisons during a leader merge, from 1,468,522,397 comparisons to 999,755,569 comparisons, which is obviously still quite significant (worker merges, if any, benefit too). I think I need to redo my parallel CREATE INDEX benchmark, so that you can take this into account. Also, I think that this patch will make very large external sorts that naturally have tens of runs to merge significantly faster, but I didn't bother to benchmark that. The patch is intended to be applied on top of parallel B-Tree patches 0001-* and 0002-* [1]. I happened to test it with parallelism, but these are all independently useful, and will be entered as a separate CF entry (perhaps better to commit the earlier two patches first, to avoid merge conflicts). I'm optimistic that we can get those 3 patches in the series out of the way early, without blocking on discussing parallel sort. The patch makes tuplesort merging shift down and displace the root tuple with the tape's next preread tuple, rather than compacting and then inserting into the heap anew. This approach to maintaining the heap as tuples are returned to caller will always produce fewer comparisons overall. The new approach is also simpler. We were already shifting down to compact the heap within the misleadingly named [2] function tuplesort_heap_siftup() -- why not instead just use the caller tuple (the tuple that we currently go on to insert) when initially shifting down (not the heap's preexisting last tuple, which is guaranteed to go straight to the leaf level again)? That way, we don't need to enlarge the heap at all through insertion, shifting up, etc. We're done, and are *guaranteed* to have performed less work (fewer comparisons and swaps) than with the existing approach (this is the reason for my optimism about getting this stuff out of the way early). This new approach is more or less the *conventional* way to maintain the heap invariant when returning elements from a heap during k-way merging. Our existing approach is convoluted; merging was presumably only coded that way because the generic functions tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be available. Perhaps the problem was masked by unrelated bottlenecks that existed at the time, too. I think that I could push this further (a minimum of 2 comparisons per item returned when 3 or more tapes are active still seems like 1 comparison too many), but what I have here gets us most of the benefit. And, it does so while not actually adding code that could be called "overly clever", IMV. I'll probably leave clever, aggressive optimization of merging for a later release. [1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com [2] https://www.postgresql.org/message-id/cam3swzq+2gjmnv7chxwexqxoplfb_few2rfexhj+gsyf39f...@mail.gmail.com -- Peter Geoghegan

From 71dcd1fc8acf8e54478526bd0a7feaf5e43bf2c1 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <p...@bowt.ie> Date: Tue, 9 Aug 2016 14:53:49 -0700 Subject: [PATCH 3/6] Displace heap's root during tuplesort merge Directly displace and shift down. This is significantly faster than shifting down to compact the heap, and then shifting up to insert a new preload tuple during tuplesort merging, especially when the merge can naturally return at least a few tuples from each tape (return them as sorted output to caller) before some other tape needs to return tuples. This is fairly common in the real world; tuple attribute values are often found to be physically clustered together on input into tuplesort.c. --- src/backend/utils/sort/tuplesort.c | 110 +++++++++++++++++++++++++++++++++---- 1 file changed, 100 insertions(+), 10 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index c80e5a1..35d7cb1 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -68,9 +68,9 @@ * (or two, when replacement selection is still used), then merge the runs * using Algorithm D. * - * When merging runs, we use a heap containing just the frontmost tuple from - * each source run; we repeatedly output the smallest tuple and insert the - * next tuple from its source tape (if any). When the heap empties, the merge + * When merging runs, we use a minheap containing only frontmost tuples from + * each source run; on output of the root tuple, it is displaced by the next + * tuple from its source tape (if any). When the heap empties, the merge * is complete. The basic merge algorithm thus needs very little memory --- * only M tuples for an M-way merge, and M is constrained to a small number. * However, we can still make good use of our full workMem allocation by @@ -572,6 +572,7 @@ static void sort_bounded_heap(Tuplesortstate *state); static void tuplesort_sort_memtuples(Tuplesortstate *state); static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple, int tupleindex, bool checkIndex); +static void tuplesort_heap_root_displace(Tuplesortstate *state, SortTuple *newtup); static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex); static void reversedirection(Tuplesortstate *state); static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK); @@ -1993,7 +1994,6 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, * more generally. */ *stup = state->memtuples[0]; - tuplesort_heap_siftup(state, false); if ((tupIndex = state->mergenext[srcTape]) == 0) { /* @@ -2014,18 +2014,21 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, */ if ((tupIndex = state->mergenext[srcTape]) == 0) { + /* compact the heap */ + tuplesort_heap_siftup(state, false); /* Free tape's buffer, avoiding dangling pointer */ if (state->batchUsed) mergebatchfreetape(state, srcTape, stup, should_free); return true; } } - /* pull next preread tuple from list, insert in heap */ + /* pull next preread tuple from list, displace root in heap */ newtup = &state->memtuples[tupIndex]; state->mergenext[srcTape] = newtup->tupindex; if (state->mergenext[srcTape] == 0) state->mergelast[srcTape] = 0; - tuplesort_heap_insert(state, newtup, srcTape, false); + newtup->tupindex = srcTape; + tuplesort_heap_root_displace(state, newtup); /* put the now-unused memtuples entry on the freelist */ newtup->tupindex = state->mergefreelist; state->mergefreelist = tupIndex; @@ -2714,8 +2717,6 @@ mergeonerun(Tuplesortstate *state, int destTape, bool finalMerge) */ spaceFreed = state->availMem - priorAvail; state->mergeavailmem[srcTape] += spaceFreed; - /* compact the heap */ - tuplesort_heap_siftup(state, false); if ((tupIndex = state->mergenext[srcTape]) == 0) { /* out of preloaded data on this tape, try to read more */ @@ -2759,18 +2760,21 @@ mergeonerun(Tuplesortstate *state, int destTape, bool finalMerge) /* if still no data, we've reached end of run on this tape */ if ((tupIndex = state->mergenext[srcTape]) == 0) { + /* compact the heap */ + tuplesort_heap_siftup(state, false); /* Free tape's buffer */ if (state->batchUsed) mergebatchfreetape(state, srcTape, NULL, NULL); continue; } } - /* pull next preread tuple from list, insert in heap */ + /* pull next preread tuple from list, displace root in heap */ tup = &state->memtuples[tupIndex]; state->mergenext[srcTape] = tup->tupindex; if (state->mergenext[srcTape] == 0) state->mergelast[srcTape] = 0; - tuplesort_heap_insert(state, tup, srcTape, false); + tup->tupindex = srcTape; + tuplesort_heap_root_displace(state, tup); /* put the now-unused memtuples entry on the freelist */ tup->tupindex = state->mergefreelist; state->mergefreelist = tupIndex; @@ -3909,6 +3913,92 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple, /* * The tuple at state->memtuples[0] has been removed from the heap. + * Shift down to find new location for next preread tuple from the tape + * that old root originated from. New tuple's tupindex must be + * initialized to its source tape number by caller. + * + * This routine is only used during merging. It is more efficient to + * displace the hole at the root than to compact and then insert into + * the heap. We prefer to not alter the size of the heap until it's + * truly necessary (until a tape is exhausted), and shift down using + * the caller's tuple directly (not the existing heap's last tuple), + * avoiding a separate insertion step entirely. + * + * When the "tape at the root of the heap" does not change, then only + * two comparisons are required, as well as one swap. (Actually, that's + * the maximum cost per subheap/iteration in general.) + */ +static void +tuplesort_heap_root_displace(Tuplesortstate *state, SortTuple *newtup) +{ + SortTuple *memtuples = state->memtuples; + int n, + imin, + i; + + Assert(memtuples[0].tupindex == newtup->tupindex); + + CHECK_FOR_INTERRUPTS(); + + n = state->memtupcount; /* n is heap's size, including old root */ + imin = 0; /* start with caller's "hole" in root */ + i = imin; + /* Shift down as required */ + for (;;) + { + int j = i * 2 + 1; + + if (j >= n) + break; + + /* + * Test newtup as candidate for new root of the (sub)heap whose + * root is i. (The i position is always a hole left either by + * caller, or by a previous iteration here.) + * + * Determine which node among left child, right child, and + * newtup-as-root has the minimum value (and as such should be + * swapped into i's position). + */ + Assert(memtuples[j].tupindex != newtup->tupindex); + if (COMPARETUP(state, newtup, &memtuples[j]) <= 0) + { + /* + * Break when new tuple "fits" as root for (sub)heap (when i + * is equal to imin, and should remain equal) + */ + if (j + 1 < n && + COMPARETUP(state, newtup, &memtuples[j + 1]) > 0) + imin = j + 1; + else + break; + } + else + { + if (j + 1 < n && + COMPARETUP(state, &memtuples[j], &memtuples[j + 1]) > 0) + imin = j + 1; + else + imin = j; + } + + /* + * Fill hole at root of (sub)heap by swapping. Either left + * child or right child of the (sub)heap considered in this + * iteration is swapped in as new root here. Next iteration + * fills the new hole this creates. Last iteration is one that + * finds place for caller's new tuple, leaving no hole. + */ + memtuples[i] = memtuples[imin]; + i = imin; + } + + Assert(state->memtupcount > 1 || imin == 0); + memtuples[imin] = *newtup; +} + +/* + * The tuple at state->memtuples[0] has been removed from the heap. * Decrement memtupcount, and sift up to maintain the heap invariant. */ static void -- 2.7.4

-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers