[HACKERS] Tuplesort merge pre-reading

Heikki Linnakangas Tue, 06 Sep 2016 05:21:07 -0700

While reviewing Peter's latest round of sorting patches, and trying tounderstand the new "batch allocation" mechanism, I started to wonder howuseful the pre-reading in the merge stage is in the first place.

I'm talking about the code that reads a bunch of from each tape, loadingthem into the memtuples array. That code was added by Tom Lane, back in1999:


commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
Author: Tom Lane <t...@sss.pgh.pa.us>
Date:   Sat Oct 30 17:27:15 1999 +0000

Further performance improvements in sorting: reduce number ofcomparisons

    during initial run formation by keeping both current run and next-run
    tuples in the same heap (yup, Knuth is smarter than I am).  And, during
    merge passes, make use of available sort memory to load multiple tuples
    from any one input 'tape' at a time, thereby improving locality of
    access to the temp file.

So apparently there was a benefit back then, but is it still worthwhile?The LogicalTape buffers one block at a time, anyway, how much gain arewe getting from parsing the tuples into SortTuple format in batches?

I wrote a quick patch to test that, attached. It seems to improveperformance, at least in this small test case:


create table lotsofints(i integer);

insert into lotsofints select random() * 1000000000.0 fromgenerate_series(1, 10000000);

vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s onunpatched master, and 6.2 s with the attached patch.

So, at least in some cases, the pre-loading hurts. I think we should getrid of it. This patch probably needs some polishing: I replaced thebatch allocations with a simpler scheme with a buffer to hold just asingle tuple for each tape, and that might need some more work to allowdownsizing those buffers if you have a few very large tuples in anotherwise narrow table. And perhaps we should free and reallocate asmaller memtuples array for the merging, now that we're not making useof the whole of it. And perhaps we should teach LogicalTape to uselarger buffers, if we can't rely on the OS to do the buffering for us.But overall, this seems to make the code both simpler and faster.


Am I missing something?

- Heikki

From ea4ce25a33d0dec370a1b5e45cbc6f794e377a90 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 6 Sep 2016 14:38:54 +0300
Subject: [PATCH 1/1] Don't bother to pre-read tuples into slots during merge.

That only seems to add overhead. We're doing the same number of READTUP()
calls either way, but we're spreading the memory usage over a larger area
if we try to pre-read, so it doesn't seem worth it. Although, we're not
using all the available memory this way. Are we now doing too short reads
from the underlying files? Perhaps we should increase the buffer size in
LogicalTape instead, if that would help?
---
 src/backend/utils/sort/tuplesort.c | 487 ++++++-------------------------------
 1 file changed, 80 insertions(+), 407 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c8fbcf8..1fc1b5e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -358,42 +358,27 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
 
 	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
+	 * Per-tape batch state, when final on-the-fly merge uses pre-allocated
+	 * buffers to hold just the latest tuple, instead of using palloc() for
+	 * each tuple. We have one buffer to hold the next tuple from each tape,
+	 * plus one buffer to hold the tuple we last returned to the caller.
 	 *
 	 * Aside from the general benefits of performing fewer individual retail
 	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
+	 * since we reuse the same memory quickly.
 	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
+	char	  **mergetuples;		/* Each tape's memory allocation */
+	int		   *mergetuplesizes;	/* size of each allocation */
+
+	char	   *mergelasttuple;
+	int			mergelasttuplesize;	/* allocated size */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -555,14 +540,8 @@ static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
 static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
 static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void mergebatch(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -1976,8 +1955,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
 				/*
 				 * Returned tuple is still counted in our memory space most of
@@ -1988,42 +1966,15 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				 */
 				*stup = state->memtuples[0];
 				tuplesort_heap_siftup(state, false);
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
-
-					mergeprereadone(state, srcTape);
 
-					/*
-					 * if still no data, we've reached end of run on this tape
-					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+				/* pull next tuple from tape, insert in heap */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
+					/* we've reached end of run on this tape */
+					return true;
 				}
-				/* pull next preread tuple from list, insert in heap */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				tuplesort_heap_insert(state, newtup, srcTape, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+
+				tuplesort_heap_insert(state, &newtup, srcTape, false);
 				return true;
 			}
 			return false;
@@ -2350,14 +2301,8 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
 	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetuplesizes = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2617,10 +2562,6 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
@@ -2635,33 +2576,21 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
 		/* compact the heap */
 		tuplesort_heap_siftup(state, false);
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
+
+		/* pull next tuple from tape, insert in heap */
+		if (!mergereadnext(state, srcTape, &stup))
 		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-				continue;
+			/* we've reached end of run on this tape */
+			continue;
 		}
-		/* pull next preread tuple from list, insert in heap */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tuplesort_heap_insert(state, tup, srcTape, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		tuplesort_heap_insert(state, &stup, srcTape, false);
 	}
 
 	/*
@@ -2704,8 +2633,6 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2729,14 +2656,6 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
 	if (finalMergeBatch)
 	{
 		/* Free outright buffers for tape never actually allocated */
@@ -2749,22 +2668,7 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 		batchmemtuples(state);
 	}
 
-	/*
-	 * Initialize space allocation to let each active input tape have an equal
-	 * share of preread space.
-	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
 
 	/*
 	 * Preallocate tuple batch memory for each tape.  This is the memory used
@@ -2773,35 +2677,21 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	 * once per sort, just in advance of the final on-the-fly merge step.
 	 */
 	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
+		mergebatch(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
+		SortTuple	tup;
 
-		if (tupIndex)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tuplesort_heap_insert(state, tup, srcTape, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
+			tuplesort_heap_insert(state, &tup, srcTape, false);
 
 #ifdef TRACE_SORT
 			if (trace_sort && finalMergeBatch)
 			{
+#if 0
 				int64		perTapeKB = (spacePerTape + 1023) / 1024;
 				int64		usedSpaceKB;
 				int			usedSlots;
@@ -2828,6 +2718,7 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 					 (double) usedSpaceKB / (double) perTapeKB,
 					 usedSlots, slotsPerTape,
 					 (double) usedSlots / (double) slotsPerTape);
+#endif
 			}
 #endif
 		}
@@ -2923,7 +2814,7 @@ batchmemtuples(Tuplesortstate *state)
  * goal.
  */
 static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
+mergebatch(Tuplesortstate *state)
 {
 	int			srcTape;
 
@@ -2943,283 +2834,46 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
 			continue;
 
 		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
+		mergetuples = MemoryContextAlloc(state->tuplecontext, BLCKSZ);
 
 		/* Initialize state for tape */
 		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
+		state->mergetuplesizes[srcTape] = BLCKSZ;
 	}
 
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
+	/* and one more buffer that's not associated with any tape initially */
+	state->mergelasttuple = MemoryContextAlloc(state->tuplecontext, BLCKSZ);
+	state->mergelasttuplesize = BLCKSZ;
 
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
-		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
-		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
+	state->batchUsed = true;
 }
 
 /*
- * mergepreread - load tuples from merge input tapes
+ * mergereadnext - load tuple from one merge input tape
  *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
+ * Returns false on EOF.
  *
  * Read tuples from the specified tape until it has used up its free memory
  * or array slots; but ensure that we have at least one tuple, if any are
  * to be had.
  */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
-
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3861,14 +3515,33 @@ readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
 {
 	if (state->batchUsed)
 	{
+		char	   *buf;
+		int			bufsize;
+
 		/*
+		 * Recycle the buffer that held the previous tuple returned from
+		 * the sort. Enlarge it if it's not large enough to hold the new
+		 * tuple.
+		 *
 		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
+		 * is based on tape-private state.
 		 */
-		return mergebatchalloc(state, tapenum, tuplen);
+		if (tuplen > state->mergelasttuplesize)
+		{
+			state->mergelasttuple = repalloc(state->mergelasttuple, tuplen);
+			state->mergelasttuplesize = tuplen;
+		}
+		buf = state->mergelasttuple;
+		bufsize = state->mergelasttuplesize;
+
+		/* we will return the previous tuple from this tape next. */
+		state->mergelasttuple = state->mergetuples[tapenum];
+		state->mergelasttuplesize = state->mergetuplesizes[tapenum];
+
+		state->mergetuples[tapenum] = buf;
+		state->mergetuplesizes[tapenum] = bufsize;
+
+		return buf;
 	}
 	else
 	{
-- 
2.9.3

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Tuplesort merge pre-reading

Reply via email to