The behavior of external sorts that do not require any merge step due to only having one run (what EXPLAIN ANALYZE output shows as an "external sort", and not a "merge sort") seems like an area that can be significantly improved upon. As noted in code comments, this optimization did not appear in The Art of Computer Programming, Volume III. It's not an unreasonable idea, but it doesn't work well on modern machines due to using heapsort, which is known to use the cache ineffectively. It also often implies significant additional I/O for little or no benefit. I suspect that all the reports we've heard of smaller work_mem sizes improving sort performance are actually down to this one-run optimization *hurting* performance.
The existing optimization for this case tries to benefit from avoiding a final N-way merge step; that's what it's all about (this is not to be confused with the other reason why a sort can end in TSS_SORTEDONTAPE -- because random tape access is required, and an *on-the-fly* TSS_FINALMERGE merge step cannot happen. Currently, TSS_SORTEDONTAPE is sometimes the "fast" case and sometimes the slow case taken only because the caller specified randomAccess, and I'm improving on only the "fast" case [1], where the caller may or may not have requested randomAccess. I require that they specify !randomAccess to use my improved optimization, though, and I'm not trying to avoid a merge step.) The existing optimization just dumps all tuples in the memtuples array (which is a heap at that point) to tape, from the top of the heap, writing a tuple out at a time, maintaining the heap invariant throughout. Then, with the memtuples array emptied, tuples are fetched from tape/disk for client code, without any merge step occurring on-the-fly (or at all). Patch -- "quicksort with spillover" ========================= With the attached patch, I propose to add an additional, better "one run special case" optimization. This new special case "splits" the single run into 2 "subruns". One of the runs is comprised of whatever tuples were in memory when the caller finished passing tuples to tuplesort. To sort that, we use quicksort, which in general has various properties that make it much faster than heapsort -- it's a cache oblivious algorithm, which is important these days. The other "subrun" is whatever tuples were on-tape when tuplesort_performsort() was called. This will often be a minority of the total, but certainly never much more than half. This is already sorted when tuplesort_performsort() is reached. This spillover is already inserted at the front of the sorted-on-tape tuples, and so already has reasonably good cache characteristics. With the patch, we perform an on-the-fly merge that is somewhat similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case, except that one of the runs is in memory, and is potentially much larger than the on-tape/disk run (but never much smaller), and is quicksorted. The existing "merge sort" case similarly is only used when the caller specified !randomAccess. For subtle reasons, the new TSS_MEMTAPEMERGE case will happen significantly more frequently than the existing, comparable TSS_SORTEDONTAPE case currently happens (this applies to !randomAccess callers only, of course). See comments added to tuplesort_performsort(), and the commit message for the details. Note that the existing, comparable case was relocated to tuplesort_performsort(), to highlight that it is now a fallback for the new TSS_MEMTAPEMERGE case (also in tuplesort_performsort()). Performance ========== This new approach can be much faster. For example: select setseed(1); -- 10 million tuple table with low cardinality integer column to sort: create unlogged table low_cardinality_int4 as select (random() * 1000)::int4 s, 'abcdefghijlmn'::text junk from generate_series(1, 10000000); set work_mem = '225MB'; -- test query: select count(distinct(s)) from low_cardinality_int4; count ------- 1001 (1 row) On my laptop, a patched Postgres takes about 4.2 seconds to execute this query. Master takes about 16 seconds. The patch sees this case quicksort 9,830,398 tuples out of a total of 10 million with this 225MB work_mem setting. This is chosen to be a sympathetic case, but it is still quite realistic. We should look at a much less sympathetic case, too. Even when the in-memory "subrun" is about as small as it can be while still having this new special case optimization occur at all, the patch still ends up pretty far ahead of the master branch. With work_mem at 100MB, 4,369,064 tuples are quicksorted by the patch when the above query is executed, which is less than half of the total (10 million). Execution with the patch now takes about 10.2 seconds. Master is about 14.7 seconds with the same work_mem setting (yes, master gets faster as the patch gets slower as work_mem is reduced...but they never come close to meeting). That's still a fairly decent improvement, and it occurs when we're on the verge of not using the optimization at all. Most users that really care about performance will at least have enough memory to benefit from this optimization when building an index on a large table, because the patch roughly halves the amount of memory you need to get at least some of the benefit of an internal sort. Performance regressions ---------------------------------- I have been unable to find any regressions in the performance of queries with the patch. If you're looking for a query that might have been regressed, I suggest coming up with a case involving a sort with pass-by-reference types that are expensive. For example, sorting a tuple on many low cardinality text attributes. Only the most extreme such cases seem likely to be regressed, though, because heapsort also has bad temporal locality. My strcoll() result caching patch [2] tries to take advantage of temporal locality rather than spatial locality, which works well with quicksort and mergesort. Memory use ----------------- The new TSS_MEMTAPEMERGE case uses no more memory than the existing "TSS_SORTEDONTAPE due to one run" optimization (actually, it uses very slightly less) if we only worry about the high-water mark. In both cases the high-water mark comes as the work_mem limit is reached. Typically, most memory isn't released until the sort is shut down. Because we're not dumping all tuples here (only as many as we need to dump to stay within our memory budget), we're also not freeing memory for each and every "tuple proper" as each and every SortTuple is written out (because we usually avoid writing out most tuples, which is an important goal of the patch). Although the memtuples array itself is often tuplesort's dominant consumer of work_mem, it's still possible that the aggregate effect of this patch on some workload's memory consumption is that more memory is used. I doubt it, though; the overall effect on memory usage will probably always be to reduce it. Finishing a sort sooner allows us to make memory available for other operations sooner. Besides, we have broken no promise to the caller wrt memory use. Predictability ========== The patch alters the performance characteristics of tuplesort, but very much for the better. With the patch, as less memory is gradually made available for sorting, performance tends to also degrade gradually, because we more or less have an algorithm that's a kind of hybrid internal/external sort, that, roughly speaking, blends from the former to the latter based on the availability of memory. It's particularly useful that performance doesn't fall off a cliff when we can no longer fit everything in memory because work_mem is slightly too small. The "almost internal" query from my example above takes about 4.2 seconds. An equivalent internal sort (quicksort) takes about 3.5 seconds, which is pretty close (~98% of tuples are quicksorted for the "almost internal" case, but heapification is a price we must pay to spill even one tuple). Furthermore, as work_mem is decreased to the point that even the optimization is no longer used -- when a traditional "merge sort"/TSS_FINALMERGE is used, instead -- there is also no big drop. *Predictable* performance characteristics are a big asset. Decreasing work_mem by 10% (or a moderate increase in the size of a table) should not make the performance of reporting queries tank. Concerns about worst case performance (e.g. particular queries suddenly taking much longer to execute) have certainly prevented me from decreasing work_mem from within postgresql.conf in the past, even though I was fairly confident that it made sense for the average case. Open Items ========= There are a few smaller open items indicated by "XXX" comments. There is a little overlap with this patch, and a couple of others that are in the queue that also affect sorting. For example, I'm considerate of cases that don't exist yet. Future work ========= In the future, we should think about optimization of the "merge sort"/TSS_FINALMERGE case, which should be made to sort *every* run using quicksort (the devil in the details there, but in general I think that runs should be quicksorted wherever possible). For now, what I came up with seems like a relatively simple approach that offers much of the benefit of that more comprehensive project, since the need to do a "merge sort" is increasingly rare due to the enormous main memory sizes that are affordable these days. Of course, Noah's excellent work on huge repalloc() also contributes to our being able to put large amounts of memory to good use when sorting. Since heapification is now a big fraction of the total cost of a sort sometimes, even where the heap invariant need not be maintained for any length of time afterwards, it might be worth revisiting the patch to make that an O(n) rather than a O(log n) operation [3]. Not sure about that, but someone should probably look into it. Jeremy Harris is CC'd here; perhaps he will consider picking that work up once more in light of this development. It would be nice to get a sort that quicksorts 99% of all tuples even closer to a conventional internal sort. [1] Seems bogus to me that EXPLAIN ANALYZE shows state TSS_SORTEDONTAPE as "external sort" rather than a "merge sort", regardless of whether or not a merge step was actually required (that is, regardless of whether or not state ended up TSS_SORTEDONTAPE due to using the existing "single run" optimization, or because caller required randomAccess) [2] https://commitfest.postgresql.org/6/294/ [3] http://www.postgresql.org/message-id/52f16843.8080...@wizmail.org -- Peter Geoghegan
From 35842f788f43712f33533a65154aafc700ff6d6f Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Wed, 29 Jul 2015 15:38:12 -0700 Subject: [PATCH 2/2] Add cursory regression tests for sorting This is not intended to be a formal patch submission. Tests are added that happened to be useful during the development of the "quicksort with spillover" patch, as the regression tests currently have precisely zero coverage for any variety of external sort operation. The tests are provided as a convenience to reviewers of that patch only. In the long run, there should be comprehensive smoke-testing of these cases (probably not in the standard regression tests), but this patch does not pretend to be any kind of basis for that. The tests added have a number of obvious problems: * They take far too long to run to be in the standard regression test suite, and yet they're run as part of that suite. With a little effort, they could probably be made to run quicker with no appreciable loss of coverage, but that didn't happen. * They're far from comprehensive. * The tests require about 1.5GB of disk space to run. * They're not portable. They might even be extremely non-portable due to implementation differences across platform pseudo-random number generators. The important point is that each query tested gives consistent results. There is no reason to think that varying work_mem settings will affect which basic approach to sorting each query takes as compared to during my original testing (at least assuming a 64-bit platform), which is also important. --- src/test/regress/expected/sorting.out | 210 ++++++++++++++++++++++++++++++++++ src/test/regress/parallel_schedule | 1 + src/test/regress/serial_schedule | 1 + src/test/regress/sql/sorting.sql | 82 +++++++++++++ 4 files changed, 294 insertions(+) create mode 100644 src/test/regress/expected/sorting.out create mode 100644 src/test/regress/sql/sorting.sql diff --git a/src/test/regress/expected/sorting.out b/src/test/regress/expected/sorting.out new file mode 100644 index 0000000..3aa30f4 --- /dev/null +++ b/src/test/regress/expected/sorting.out @@ -0,0 +1,210 @@ +-- +-- sorting tests +-- +-- +-- int4 test (10 million tuples, medium cardinality) +-- +select setseed(1); + setseed +--------- + +(1 row) + +create unlogged table int4_sort_tbl as + select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); +-- +-- int4 test (10 million tuples, high cardinality) +-- +create unlogged table highcard_int4_sort_tbl as + select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); +-- +-- int4 test (10 million tuples, low cardinality) +-- +create unlogged table lowcard_int4_sort_tbl as + select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); +-- Results should be consistent: +set work_mem = '64MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +set work_mem = '100MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +set work_mem = '110MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +set work_mem = '128MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +set work_mem = '140MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +set work_mem = '150MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +-- should be in-memory: +set work_mem = '512MB'; +select count(distinct(s)) from int4_sort_tbl; + count +-------- + 999949 +(1 row) + +select count(distinct(s)) from highcard_int4_sort_tbl; + count +--------- + 9515397 +(1 row) + +select count(distinct(s)) from lowcard_int4_sort_tbl; + count +------- + 101 +(1 row) + +-- +-- text test (uses abbreviated keys, 10 million tuples) +-- +select setseed(1); + setseed +--------- + +(1 row) + +create unlogged table text_sort_tbl as + select (random() * 100000)::int4::text s + from generate_series(1, 10000000); +-- Start with sort that results in 3-way final merge: +set work_mem = '190MB'; +select count(distinct(s)) from text_sort_tbl; + count +-------- + 100001 +(1 row) + +-- Uses optimization where it's marginal: +set work_mem = '200MB'; +select count(distinct(s)) from text_sort_tbl; + count +-------- + 100001 +(1 row) + +-- Uses optimization where it's favorable: +set work_mem = '450MB'; +select count(distinct(s)) from text_sort_tbl; + count +-------- + 100001 +(1 row) + +-- internal sort: +set work_mem = '500MB'; +select count(distinct(s)) from text_sort_tbl; + count +-------- + 100001 +(1 row) + +drop table int4_sort_tbl; +drop table highcard_int4_sort_tbl; +drop table lowcard_int4_sort_tbl; +drop table text_sort_tbl; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 4df15de..7ff656a 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -37,6 +37,7 @@ test: geometry horology regex oidjoins type_sanity opr_sanity # ---------- test: insert test: insert_conflict +test: sorting test: create_function_1 test: create_type test: create_table diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule index 15d74d4..ebe7de0 100644 --- a/src/test/regress/serial_schedule +++ b/src/test/regress/serial_schedule @@ -51,6 +51,7 @@ test: type_sanity test: opr_sanity test: insert test: insert_conflict +test: sorting test: create_function_1 test: create_type test: create_table diff --git a/src/test/regress/sql/sorting.sql b/src/test/regress/sql/sorting.sql new file mode 100644 index 0000000..75453fc --- /dev/null +++ b/src/test/regress/sql/sorting.sql @@ -0,0 +1,82 @@ +-- +-- sorting tests +-- + +-- +-- int4 test (10 million tuples, medium cardinality) +-- +select setseed(1); +create unlogged table int4_sort_tbl as + select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); + +-- +-- int4 test (10 million tuples, high cardinality) +-- +create unlogged table highcard_int4_sort_tbl as + select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); + +-- +-- int4 test (10 million tuples, low cardinality) +-- +create unlogged table lowcard_int4_sort_tbl as + select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk + from generate_series(1, 10000000); + +-- Results should be consistent: +set work_mem = '64MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +set work_mem = '100MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +set work_mem = '110MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +set work_mem = '128MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +set work_mem = '140MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +set work_mem = '150MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; +-- should be in-memory: +set work_mem = '512MB'; +select count(distinct(s)) from int4_sort_tbl; +select count(distinct(s)) from highcard_int4_sort_tbl; +select count(distinct(s)) from lowcard_int4_sort_tbl; + +-- +-- text test (uses abbreviated keys, 10 million tuples) +-- +select setseed(1); +create unlogged table text_sort_tbl as + select (random() * 100000)::int4::text s + from generate_series(1, 10000000); + +-- Start with sort that results in 3-way final merge: +set work_mem = '190MB'; +select count(distinct(s)) from text_sort_tbl; +-- Uses optimization where it's marginal: +set work_mem = '200MB'; +select count(distinct(s)) from text_sort_tbl; +-- Uses optimization where it's favorable: +set work_mem = '450MB'; +select count(distinct(s)) from text_sort_tbl; +-- internal sort: +set work_mem = '500MB'; +select count(distinct(s)) from text_sort_tbl; + +drop table int4_sort_tbl; +drop table highcard_int4_sort_tbl; +drop table lowcard_int4_sort_tbl; +drop table text_sort_tbl; -- 1.9.1
From e0f8a3c4bd6863c90d4169b5cd6a222c7045d6ad Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Wed, 29 Jul 2015 15:38:12 -0700 Subject: [PATCH 1/2] Quicksort when only one initial run is produced Problem ======= Previously, in the event of only producing a single run, by the time tuplesort_performsort() is called tuplesort had only incrementally spilled tuples from the memtuples array (which is a heap at that stage) onto tape. The incremental spilling occurs by consuming from the heap, meaning that on-tape tuples are sorted. Following dumping all tuples at once (in a similar fashion to how tuples were incrementally spilled before that point) and observing that there was only ever one run, tuplesort realized its "one run" optimization was applicable (In general, tuplesort imagined that once a sort cannot fit entirely in memory, all tuples must be dumped within tuplesort_performsort()). All tuples end up sorted and on tape (tuple comparisons occur as the heap invariant is maintained as the memtuples heap is spilled). This approach is useful because no final merge step is required, unlike the "merge sort" case where more than one run is used and multiple tapes must be merged. However, it had a number of significant disadvantages, including: * The tuples often don't actually need to be written out to tape/disk, but that happens anyway. * The sort step is a degenerate heapsort. Heapsort is an algorithm that performs very poorly on modern CPUs with fast caches. * Since we are not quicksorting, that means we're unable to use the "onlyKey" optimization. Single-attribute heaptuple and datumtuple sorts that don't use abbreviated keys benefit from this optimization considerably. Solution ======== We can often (but not always) do better. In the event of: 1. Having only a single conventional run (or, more specifically, only one run that made it to disk/tape -- a new, broader criteria for applying a "one run" optimization) 2. The caller not requiring random access (a new, more restrictive criteria for apply a "one run" optimization) Then: 1. Split input tuples into two conceptual "subruns": one on-disk, and the other in-memory 2. Quicksort the memtuples array subrun (the other subrun is already sorted when the decision to apply the new optimization is made) 3. Finally, merge the new subruns on-the-fly, much like the existing "external merge" case Testing shows that this can perform significantly better than the existing "one run" optimization. --- src/backend/commands/explain.c | 13 +- src/backend/utils/sort/tuplesort.c | 332 +++++++++++++++++++++++++++++++++---- src/include/utils/tuplesort.h | 3 +- 3 files changed, 309 insertions(+), 39 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 5d06fa4..4f61ee2 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -2178,20 +2178,27 @@ show_sort_info(SortState *sortstate, ExplainState *es) const char *sortMethod; const char *spaceType; long spaceUsed; + int rowsQuicksorted; - tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed); + tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed, + &rowsQuicksorted); if (es->format == EXPLAIN_FORMAT_TEXT) { appendStringInfoSpaces(es->str, es->indent * 2); - appendStringInfo(es->str, "Sort Method: %s %s: %ldkB\n", - sortMethod, spaceType, spaceUsed); + appendStringInfo(es->str, + "Sort Method: %s %s: %ldkB Rows Quicksorted: %d", + sortMethod, + spaceType, + spaceUsed, + rowsQuicksorted); } else { ExplainPropertyText("Sort Method", sortMethod, es); ExplainPropertyLong("Sort Space Used", spaceUsed, es); ExplainPropertyText("Sort Space Type", spaceType, es); + ExplainPropertyInteger("Rows Quicksorted", rowsQuicksorted, es); } } } diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 435041a..5371f4f 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -72,6 +72,15 @@ * to one run per logical tape. The final merge is then performed * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this * saves one cycle of writing all the data out to disk and reading it in. + * Also, if only one run is spilled to tape so far when + * tuplesort_performsort() is reached, and if the caller does not require + * random access, then the merge step can take place between still + * in-memory tuples, and tuples stored on tape (it does not matter that + * there may be a second run in that array -- only that a second one has + * spilled). This ensures that spilling to disk only occurs for a number of + * tuples approximately equal to the number of tuples read in after + * work_mem was reached and it became apparent that an external sort is + * required. * * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according @@ -186,6 +195,7 @@ typedef enum TSS_BUILDRUNS, /* Loading tuples; writing to tape */ TSS_SORTEDINMEM, /* Sort completed entirely in memory */ TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */ + TSS_MEMTAPEMERGE, /* Performing memory/tape merge on-the-fly */ TSS_FINALMERGE /* Performing final merge on-the-fly */ } TupSortStatus; @@ -327,12 +337,22 @@ struct Tuplesortstate int activeTapes; /* # of active input tapes in merge pass */ /* + * These variables are used after tuplesort_performsort() for the + * TSS_MEMTAPEMERGE case. This is a special, optimized final on-the-fly + * merge pass involving merging the result tape with memtuples that were + * quicksorted (but never made it out to a tape). + */ + SortTuple tape_cache; /* cached tape tuple from prior call */ + bool cached; /* tape_cache holds pending tape tuple */ + bool just_memtuples; /* merge only fetching from memtuples */ + + /* * These variables are used after completion of sorting to keep track of * the next tuple to return. (In the tape case, the tape's current read * position is also critical state.) */ int result_tape; /* actual tape number of finished output */ - int current; /* array index (only used if SORTEDINMEM) */ + int current; /* array index (only used if quicksorted) */ bool eof_reached; /* reached EOF (needed for cursors) */ /* markpos_xxx holds marked position for mark and restore */ @@ -460,6 +480,7 @@ struct Tuplesortstate static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess); static void puttuple_common(Tuplesortstate *state, SortTuple *tuple); static bool consider_abort_common(Tuplesortstate *state); +static void tuplesort_quicksort(Tuplesortstate *state); static void inittapes(Tuplesortstate *state); static void selectnewtape(Tuplesortstate *state); static void mergeruns(Tuplesortstate *state); @@ -1549,6 +1570,23 @@ consider_abort_common(Tuplesortstate *state) return false; } +static void +tuplesort_quicksort(Tuplesortstate *state) +{ + if (state->memtupcount > 1) + { + /* Can we use the single-key sort function? */ + if (state->onlyKey != NULL) + qsort_ssup(state->memtuples, state->memtupcount, + state->onlyKey); + else + qsort_tuple(state->memtuples, + state->memtupcount, + state->comparetup, + state); + } +} + /* * All tuples have been provided; finish the sort. */ @@ -1569,20 +1607,9 @@ tuplesort_performsort(Tuplesortstate *state) /* * We were able to accumulate all the tuples within the allowed - * amount of memory. Just qsort 'em and we're done. + * amount of memory. Just quicksort 'em and we're done. */ - if (state->memtupcount > 1) - { - /* Can we use the single-key sort function? */ - if (state->onlyKey != NULL) - qsort_ssup(state->memtuples, state->memtupcount, - state->onlyKey); - else - qsort_tuple(state->memtuples, - state->memtupcount, - state->comparetup, - state); - } + tuplesort_quicksort(state); state->current = 0; state->eof_reached = false; state->markpos_offset = 0; @@ -1609,12 +1636,111 @@ tuplesort_performsort(Tuplesortstate *state) /* * Finish tape-based sort. First, flush all tuples remaining in - * memory out to tape; then merge until we have a single remaining - * run (or, if !randomAccess, one run per tape). Note that - * mergeruns sets the correct state->status. + * memory out to tape where that's required (it's required when + * more than one run is represented on tape, or when the caller + * required random access). + * + * We do this before considering which of the three cases are + * taken, because dumping tuples can increase currentRun, + * altering the outcome. */ - dumptuples(state, true); - mergeruns(state); + if (state->currentRun > 0 || state->randomAccess) + dumptuples(state, true); + + if (state->currentRun > 1) + { + /* + * Merge until we have a single remaining run (or, if + * !randomAccess, one run per tape). Note that mergeruns sets + * the correct state->status. + */ + mergeruns(state); + } + else if (state->currentRun == 0) + { + /* + * Avoid dumping most tuples. + * + * Since we produced only one initial run and the caller + * does not require random access, we can use our result + * tape as part of the finished output. We quicksort the + * remaining unsorted (although heapified) tuples in the + * memtuples array, and merge that with the sorted-on-tape + * contents of this run. + * + * This is similar to TSS_FINALMERGE, which also has an + * on-the-fly merge step; the major difference is that we + * merge memtuples with a tape (each is therefore a "subrun"). + * Merging is fairly fast when we merge from memory to a + * single tape, and this avoids significant I/O. + * + * Note that there might actually be 2 runs, but only the + * contents of one of them went to tape, and so we can + * safely "pretend" that there is only 1 run (since we're + * about to give up on the idea of the memtuples array being + * a heap). This means that if our sort happened to require + * random access, the similar "single run" optimization + * below (which sets TSS_SORTEDONTAPE) might not be used at + * all. This is because dumping all tuples out might have + * forced an otherwise equivalent randomAccess case to + * acknowledge a second run, which we can avoid. + * + * And so, this optimization is more effective than the + * similar 1 run TSS_SORTEDONTAPE optimization below because + * it will be used more frequently. A further advantage is + * that quicksort is very fast. Modern CPUs tend to be + * enormously bottlenecked on memory access, which heap sort + * exacerbates, whereas quicksort is a cache-oblivious + * algorithm, and so uses the cache optimally. The fact + * that the memtuples array has already been heapified is no + * reason to commit to the path of unnecessarily dumping and + * heap-sorting input tuples. + * + * Effectively, our "single" run is "split" into an on-disk + * subrun and a memtuples subrun that stores a significant + * proportion of all tuples to be sorted. Often, this + * subrun is much larger than the tape subrun, which is + * where the optimization is most effective. + */ + Assert(!state->randomAccess); + markrunend(state, state->currentRun); + + /* + * Note that the quicksort may use abbreviated keys, which + * are no longer available for the tuples that spilled to + * tape. This is something that the final on-the-fly merge + * step accounts for. + */ + tuplesort_quicksort(state); + state->current = 0; + +#ifdef TRACE_SORT + if (trace_sort) + elog(LOG, "finished quicksort of %d tuples, a large portion of final run: %s", + state->memtupcount, + pg_rusage_show(&state->ru_start)); +#endif + + state->result_tape = state->tp_tapenum[state->destTape]; + /* Must freeze and rewind the finished output tape */ + LogicalTapeFreeze(state->tapeset, state->result_tape); + state->status = TSS_MEMTAPEMERGE; + } + else + { + /* + * Since we produced only one initial run (quite likely if + * the total data volume is between 1X and 2X workMem), we + * can just use that tape as the finished output, rather + * than doing a useless merge. (This obvious optimization + * is not in Knuth's algorithm.) + */ + Assert(state->randomAccess); + state->result_tape = state->tp_tapenum[state->destTape]; + /* Must freeze and rewind the finished output tape */ + LogicalTapeFreeze(state->tapeset, state->result_tape); + state->status = TSS_SORTEDONTAPE; + } state->eof_reached = false; state->markpos_block = 0L; state->markpos_offset = 0; @@ -1633,6 +1759,9 @@ tuplesort_performsort(Tuplesortstate *state) elog(LOG, "performsort done (except %d-way final merge): %s", state->activeTapes, pg_rusage_show(&state->ru_start)); + else if (state->status == TSS_MEMTAPEMERGE) + elog(LOG, "performsort done (except memory/tape final merge): %s", + pg_rusage_show(&state->ru_start)); else elog(LOG, "performsort done: %s", pg_rusage_show(&state->ru_start)); @@ -1784,6 +1913,138 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, READTUP(state, stup, state->result_tape, tuplen); return true; + case TSS_MEMTAPEMERGE: + Assert(forward); + /* For now, assume tuple should not be freed by caller */ + *should_free = false; + + if (state->eof_reached) + return false; + + if (state->just_memtuples) + goto just_memtuples; + + /* + * Should be at least one memtuple, since no "all" dump of + * memtuples occurred (avoiding that is a big reason for the + * existence of the TSS_MEMTAPEMERGE case) + */ + Assert(state->memtupcount > 0); + + /* + * Merge together quicksorted memtuples array, and sorted tape. + * + * When the decision was taken to apply this optimization, the + * array was heapified. Some number of tuples were spilled to + * disk from the top of the heap, and are read from tape here in + * fully sorted order. That does not imply that there are no + * tuples in the memtuples array that are less than any tuple + * stored on tape; that's why this merge step is necessary. + * + * "stup" is always the current tape tuple, which may be cached + * from previous call, or read from tape in this call (and + * cached for next). + * + * Exhaust the supply of tape tuples first. + */ + if (state->cached) + { + *stup = state->tape_cache; + } + else if ((tuplen = getlen(state, state->result_tape, true)) != 0) + { + READTUP(state, stup, state->result_tape, tuplen); + state->tape_cache = *stup; + } + else + { + /* + * Initially, we mostly only returned tape tuples, with a + * few interspersed memtuples. From here on, we just fetch + * from memtuples array. + * + * We may end up here rather quickly, particularly when + * there are proportionally fewer tape tuples, which is + * common. This is because the tape tuples overall sorted + * order skews very low (they're not guaranteed to be + * strictly lower than any memtuples tuple, but since they + * were written to tape from the top of a then-heapified + * memtuples array, many will be). + */ + state->just_memtuples = true; +just_memtuples: + *should_free = false; + + /* + * XXX: If this routine ever supports memory prefetching, + * it ought to happen here too. Often, the majority of + * tuples will be returned to caller from here, in a manner + * very similar to the TSS_SORTEDINMEM case. + */ + if (state->current < state->memtupcount) + { + *stup = state->memtuples[state->current++]; + return true; + } + state->eof_reached = true; + return false; + } + + /* + * Kludge: Trigger abbreviated tie-breaker if in-memory tuples + * use abbreviation (writing tuples to tape never preserves + * abbreviated keys). Do this by assigning in-memory + * abbreviated tuple to tape tuple directly. + * + * It doesn't seem worth generating a new abbreviated key for + * the tape tuple, and this approach is simpler than + * "unabbreviating" the memtuple tuple from a "common" routine + * like this. + * + * In the future, this routine could offer an API that allows + * certain clients (like ordered set aggregate callers) to + * cheaply test *inequality* across adjacent pairs of sorted + * tuples on the basis of simple abbreviated key binary + * inequality. Another advantage of this approach is that that + * can still work without reporting to clients that abbreviation + * wasn't used. The tape tuples might only be a small minority + * of all tuples returned. + */ + if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL) + stup->datum1 = state->memtuples[state->current].datum1; + + /* + * Compare a tape tuple to a memtuple. + * + * Since we always start with at least one memtuple, and since tape + * tuples are always returned before equal memtuples, it follows + * that there must be at least one memtuple left to return here. + */ + Assert(state->current < state->memtupcount); + + if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0) + { + /* + * Tape tuple less than or equal to memtuple array current + * iterator position. Return it. + */ + state->cached = false; + *should_free = true; + } + else + { + /* + * Tape tuple greater than memtuple array's current tuple. + * + * Return current memtuple tuple, and cache tape tuple for + * next call, where it may be returned. + */ + state->cached = true; + *should_free = false; + *stup = state->memtuples[state->current++]; + } + return true; + case TSS_FINALMERGE: Assert(forward); *should_free = true; @@ -2120,6 +2381,8 @@ inittapes(Tuplesortstate *state) state->tp_runs = (int *) palloc0(maxTapes * sizeof(int)); state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int)); state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int)); + state->cached = false; + state->just_memtuples = false; /* * Convert the unsorted contents of memtuples[] into a heap. Each tuple is @@ -2213,21 +2476,6 @@ mergeruns(Tuplesortstate *state) Assert(state->status == TSS_BUILDRUNS); Assert(state->memtupcount == 0); - /* - * If we produced only one initial run (quite likely if the total data - * volume is between 1X and 2X workMem), we can just use that tape as the - * finished output, rather than doing a useless merge. (This obvious - * optimization is not in Knuth's algorithm.) - */ - if (state->currentRun == 1) - { - state->result_tape = state->tp_tapenum[state->destTape]; - /* must freeze and rewind the finished output tape */ - LogicalTapeFreeze(state->tapeset, state->result_tape); - state->status = TSS_SORTEDONTAPE; - return; - } - if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL) { /* @@ -2770,7 +3018,8 @@ void tuplesort_get_stats(Tuplesortstate *state, const char **sortMethod, const char **spaceType, - long *spaceUsed) + long *spaceUsed, + int *rowsQuicksorted) { /* * Note: it might seem we should provide both memory and disk usage for a @@ -2780,6 +3029,10 @@ tuplesort_get_stats(Tuplesortstate *state, * rely on availMem in a disk sort. This does not seem worth the overhead * to fix. Is it worth creating an API for the memory context code to * tell us how much is actually used in sortcontext? + * + * XXX: Revisit this restriction. TSS_MEMTAPEMERGE may require both + * memory and disk space usage (although since memory used should always + * be work_mem, perhaps not). */ if (state->tapeset) { @@ -2792,17 +3045,26 @@ tuplesort_get_stats(Tuplesortstate *state, *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024; } + *rowsQuicksorted = 0; + switch (state->status) { case TSS_SORTEDINMEM: if (state->boundUsed) *sortMethod = "top-N heapsort"; else + { *sortMethod = "quicksort"; + *rowsQuicksorted = state->memtupcount; + } break; case TSS_SORTEDONTAPE: *sortMethod = "external sort"; break; + case TSS_MEMTAPEMERGE: + *sortMethod = "quicksort with spillover"; + *rowsQuicksorted = state->memtupcount; + break; case TSS_FINALMERGE: *sortMethod = "external merge"; break; diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h index de6fc56..e5f5501 100644 --- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state); extern void tuplesort_get_stats(Tuplesortstate *state, const char **sortMethod, const char **spaceType, - long *spaceUsed); + long *spaceUsed, + int *rowsQuicksorted); extern int tuplesort_merge_order(int64 allowedMem); -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers