Re: [HACKERS] Using quicksort for every external sort run

Greg Stark Tue, 08 Dec 2015 18:41:29 -0800

On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>
> Then in the next (final) merge, it is has to read in this huge
> fragmented tape run emulation, generating a lot of random IO to read
> it.


This seems fairly plausible. Logtape.c is basically implementing a
small filesystem and doesn't really make any attempt to avoid
fragmentation. The reason it does this is so that we can reuse blocks
and avoid needing to store 2x disk space for the temporary space. I
wonder if we're no longer concerned about keeping the number of tapes
down if it makes sense to give up on this goal too and just write out
separate files for each tape letting the filesystem avoid
fragmentation. I suspect it would also be better for filesystems like
ZFS and SSDs where rewriting blocks can be expensive.


> With the patched code, the average length of reads on files in
> pgsql_tmp between lseeks or changing to a different file descriptor is
> 8, while in the unpatched code it is 14.

I don't think Peter did anything to the scheduling of the merges so I
don't see how this would be different. It might just have hit a
preexisting case by changing the number and size of tapes.

I also don't think the tapes really ought to be so unbalanced. I've
noticed some odd things myself -- like what does a 1-way merge mean
here?

LOG:  finished writing run 56 to tape 2 (9101313 blocks): CPU
0.19s/10.97u sec elapsed 16.68 sec
LOG:  finished writing run 57 to tape 3 (9084929 blocks): CPU
0.19s/11.14u sec elapsed 19.08 sec
LOG:  finished writing run 58 to tape 4 (9101313 blocks): CPU
0.20s/11.31u sec elapsed 19.26 sec
LOG:  performsort starting: CPU 0.20s/11.48u sec elapsed 19.44 sec
LOG:  finished writing run 59 to tape 5 (9109505 blocks): CPU
0.20s/11.49u sec elapsed 19.44 sec
LOG:  finished writing final run 60 to tape 6 (8151041 blocks): CPU
0.20s/11.55u sec elapsed 19.50 sec
LOG:  finished 1-way merge step (1810433 blocks): CPU 0.20s/11.58u sec
elapsed 19.54 sec   <-------------------------=========
LOG:  finished 10-way merge step (19742721 blocks): CPU 0.20s/12.23u
sec elapsed 20.19 sec
LOG:  finished 13-way merge step (23666689 blocks): CPU 0.20s/13.15u
sec elapsed 21.11 sec
LOG:  finished 13-way merge step (47333377 blocks): CPU 0.22s/14.07u
sec elapsed 23.13 sec
LOG:  finished 14-way merge step (47333377 blocks): CPU 0.24s/15.65u
sec elapsed 24.74 sec
LOG:  performsort done (except 14-way final merge): CPU 0.24s/15.66u
sec elapsed 24.75 sec

I wonder if something's wrong with the merge scheduling.

Fwiw attached are two patches for perusal. One is a trivial patch to
add the size of the tape to trace_sort output. I guess I'll just apply
that without discussion. The other replaces the selection sort with an
open coded sort network for cases up to 8 elements. (Only in the perl
generated qsort for the moment). I don't have the bandwidth to
benchmark this for the moment but if anyone's interested in trying I
suspect it'll make a small but noticeable difference. I'm guessing
2-5%.

-- 
greg

 src/backend/utils/sort/logtape.c   | 9 +++++++++
 src/backend/utils/sort/tuplesort.c | 7 +++++--
 src/include/utils/logtape.h        | 1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 252ba22..20251ab 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -1014,3 +1014,12 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Obtain total disk space currently used by a LogicalTapeSet, in blocks.
+ */
+long
+LogicalTapeBlocks(LogicalTapeSet *lts, int tapenum)
+{
+	return lts->tapes[tapenum].numFullBlocks * BLCKSZ + 1;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6572af8..5d996f2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2424,7 +2424,9 @@ mergeonerun(Tuplesortstate *state)
 
 #ifdef TRACE_SORT
 	if (trace_sort)
-		elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
+		elog(LOG, "finished %d-way merge step (%ld blocks): %s", 
+			 state->activeTapes,
+			 LogicalTapeBlocks(state->tapeset, destTape),
 			 pg_rusage_show(&state->ru_start));
 #endif
 }
@@ -2654,9 +2656,10 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
+				elog(LOG, "finished writing%s run %d to tape %d (%ld blocks): %s",
 					 (state->memtupcount == 0) ? " final" : "",
 					 state->currentRun, state->destTape,
+					 LogicalTapeBlocks(state->tapeset, state->destTape),
 					 pg_rusage_show(&state->ru_start));
 #endif
 
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index 47426ce..0b7aefe 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -40,5 +40,6 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
+extern long LogicalTapeBlocks(LogicalTapeSet *lts, int tapenum);
 
 #endif   /* LOGTAPE_H */
-- 
2.6.2

 src/backend/utils/sort/gen_qsort_tuple.pl | 62 ++++++++++++++++++++++++++++---
 1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/sort/gen_qsort_tuple.pl b/src/backend/utils/sort/gen_qsort_tuple.pl
index 6186d0a..7d751b7 100644
--- a/src/backend/utils/sort/gen_qsort_tuple.pl
+++ b/src/backend/utils/sort/gen_qsort_tuple.pl
@@ -162,12 +162,64 @@ qsort_$SUFFIX(SortTuple *a, size_t n$EXTRAARGS)
 
 loop:
 	CHECK_FOR_INTERRUPTS();
-	if (n < 7)
-	{
-		for (pm = a + 1; pm < a + n; pm++)
-			for (pl = pm; pl > a && cmp_$SUFFIX(pl - 1, pl$CMPPARAMS) > 0; pl--)
-				swap(pl, pl - 1);
+	if (n <= 8) {
+		#define swapif(l,r) if (cmp_$SUFFIX(a+l,a+r$CMPPARAMS) > 0) swap(a+l,a+r)
+		switch(n) {
+			case 0: 
+			case 1:
+			break;
+
+			case 2:
+			swapif(0,1);
+			break;
+
+			case 3:
+			swapif(0,1); swapif(1,2);
+			swapif(0,1);
+			break;
+
+			case 4:
+			swapif(0,1); swapif(2,3);
+			swapif(1,3); swapif(0,2);
+			swapif(1,2);
+			break;
+
+			case 5:
+			swapif(1,2); swapif(3,4);
+			swapif(1,3); swapif(0,2);
+			swapif(2,4); swapif(0,3);
+			swapif(0,1); swapif(2,3);
+			swapif(1,2);
+			break;
+
+			case 6:
+			swapif(0,1); swapif(2,3); swapif(4,5);
+			swapif(0,2); swapif(3,5); swapif(1,4);
+			swapif(0,1); swapif(2,3); swapif(4,5);
+			swapif(1,2); swapif(3,4);
+			swapif(2,3);
+			break;
+
+			case 7:
+			swapif(1,2); swapif(3,4); swapif(5,6);
+			swapif(0,2); swapif(4,6); swapif(3,5);
+			swapif(2,6); swapif(1,5); swapif(0,4);
+			swapif(2,5); swapif(0,3);
+			swapif(2,4); swapif(1,3);
+			swapif(0,1); swapif(2,3); swapif(4,5);
+			break;
+
+			case 8:
+			swapif(0, 1); swapif(2, 3); swapif(4, 5); swapif(6, 7);
+			swapif(0, 3); swapif(1, 2); swapif(4, 7); swapif(5, 6);
+			swapif(0, 1); swapif(2, 3); swapif(4, 5); swapif(6, 7);
+			swapif(0, 7); swapif(1, 6); swapif(2, 5); swapif(3, 4);
+			swapif(1, 3); swapif(0, 2); swapif(5, 7); swapif(4, 6);
+			swapif(0, 1); swapif(2, 3); swapif(4, 5); swapif(6, 7);
+			break;
+		}
 		return;
+		#undef swapif
 	}
 	presorted = 1;
 	for (pm = a + 1; pm < a + n; pm++)
-- 
2.6.2

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to