On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>
>> Pavan Deolasee <pavan.deola...@gmail.com> writes:
>>>
>>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera
>>> <alvhe...@2ndquadrant.com>
>>> wrote:
>>>>
>>>> One thing not quite clear to me is how do we create the bitmap
>>>> representation starting from the array representation in midflight
>>>> without using twice as much memory transiently.  Are we going to write
>>>> the array to a temp file, free the array memory, then fill the bitmap by
>>>> reading the array from disk?
>>
>>
>>> We could do that.
>>
>>
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.
>
>
> The people are usually running out of space for data, while these files
> would be temporary files placed wherever temp_tablespaces points to. I'd
> argue if this is a source of problems, the people are already in deep
> trouble due to sorts, CREATE INDEX, ... as those commands may also generate
> a lot of temporary files.

One would not expect "CREATE INDEX" to succeed when space is tight,
but VACUUM is quite the opposite.

Still, temporary storage could be used if available, and gracefully
fall back to some other technique (like not using bitmaps) when not.

Not sure it's worth the trouble, though.


On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> I am kind of doubtful about this whole line of investigation because
>> we're basically trying pretty hard to fix something that I'm not sure
>> is broken.    I do agree that, all other things being equal, the TID
>> lookups will probably be faster with a bitmap than with a binary
>> search, but maybe not if the table is large and the number of dead
>> TIDs is small, because cache efficiency is pretty important.  But even
>> if it's always faster, does TID lookup speed even really matter to
>> overall VACUUM performance? Claudio's early results suggest that it
>> might, but maybe that's just a question of some optimization that
>> hasn't been done yet.
>
> FYI, the reported impact was on CPU time, not runtime. There was no
> significant difference in runtime (real time), because my test is
> heavily I/O bound.
>
> I tested with a few small tables and there was no significant
> difference either, but small tables don't stress the array lookup
> anyway so that's expected.
>
> But on the assumption that some systems may be CPU bound during vacuum
> (particularly those able to do more than 300-400MB/s sequential I/O),
> in those cases the increased or decreased cost of lazy_tid_reaped will
> directly correlate to runtime. It's just none of my systems, which all
> run on amazon and is heavily bandwidth constrained (fastest I/O
> subsystem I can get my hands on does 200MB/s).

Attached is the patch with the multiarray version.

The tests are weird. Best case comparison over several runs, to remove
the impact of concurrent activity on this host (I couldn't remove all
background activity even when running the tests overnight, the distro
adds tons of crons and background cleanup tasks it would seem),
there's only very mild CPU impact. I'd say insignificant, as it's well
below the mean variance.

Worst case:

DETAIL:  CPU 9.90s/80.94u sec elapsed 1232.42 sec.

Best case:

DETAIL:  CPU 12.10s/63.82u sec elapsed 832.79 sec.

There seems to be more variance with the multiarray approach than the
single array one, but I could not figure out why. Even I/O seems less
stable:

Worst case:

INFO:  "pgbench_accounts": removed 400000000 row versions in 6557382 pages
DETAIL:  CPU 64.31s/37.60u sec elapsed 2573.88 sec.

Best case:

INFO:  "pgbench_accounts": removed 400000000 row versions in 6557378 pages
DETAIL:  CPU 54.48s/31.78u sec elapsed 1552.18 sec.

Since this test takes several hours to complete, I could only run a
few runs of each version, so the statistical significance of the test
isn't very bright.

I'll try to compare with smaller pgbench scale numbers and more runs
over the weekend (gotta script that). It's certainly puzzling, I
cannot explain the increased variance, especially in I/O, since the
I/O should be exactly the same. I'm betting it's my system that's
unpredictable somehow. So I'm posting the patch in case someone gets
inspired and can spot the reason, and because there's been a lot of
talk about this very same approach, so I thought I'd better post the
code ;)

I'll also try to get a more predictable system to run the tests on.
From f85fd4213eb6c0b4da8dc9196424f6e8a5a2a9a7 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 252 +++++++++++++++++++++++++++++---------
 1 file changed, 192 insertions(+), 60 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..2df17b1 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,22 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment {
+	int		num_dead_tuples;	/* # of entries in the segment */
+	int		max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset until the segment is fully populated) */
+	unsigned short  padding;
+	ItemPointer	dead_tuples;		/* Array of dead tuples */
+} DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray {
+	int		num_entries;	/* current # of entries */
+	int		max_entries;	/* total # of slots that can be allocated in array */
+	int		num_segs;	/* number of dead tuple segments allocated */
+	int		last_seg;	/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment	*dead_tuples;	/* array of num_segs segments */
+} DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -117,14 +134,14 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
-	ItemPointer dead_tuples;	/* array of ItemPointerData */
+	DeadTuplesMultiArray	dead_tuples;
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+#define DeadTuplesCurrentSegment(lvrelstats) (&((lvrelstats)->dead_tuples.dead_tuples[ \
+	(lvrelstats)->dead_tuples.last_seg ]))
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -149,7 +166,7 @@ static void lazy_cleanup_index(Relation indrel,
 				   IndexBulkDeleteResult *stats,
 				   LVRelStats *vacrelstats);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+				 int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment *seg, Buffer *vmbuffer);
 static bool should_attempt_truncation(LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -157,8 +174,9 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 					   ItemPointer itemptr);
+static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
+static inline int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 					 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 
@@ -498,7 +516,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = nblocks;
-	initprog_val[2] = vacrelstats->max_dead_tuples;
+	initprog_val[2] = vacrelstats->dead_tuples.max_entries;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
@@ -677,8 +695,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
-		if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&
-			vacrelstats->num_dead_tuples > 0)
+		if ((vacrelstats->dead_tuples.max_entries - vacrelstats->dead_tuples.num_entries) < MaxHeapTuplesPerPage &&
+			vacrelstats->dead_tuples.num_entries > 0)
 		{
 			const int	hvp_index[] = {
 				PROGRESS_VACUUM_PHASE,
@@ -729,7 +747,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * not to reset latestRemovedXid since we want that value to be
 			 * valid.
 			 */
-			vacrelstats->num_dead_tuples = 0;
+			lazy_clear_dead_tuples(vacrelstats);
 			vacrelstats->num_index_scans++;
 
 			/* Report that we are once again scanning the heap */
@@ -911,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		has_dead_tuples = false;
 		nfrozen = 0;
 		hastup = false;
-		prev_dead_count = vacrelstats->num_dead_tuples;
+		prev_dead_count = vacrelstats->dead_tuples.num_entries;
 		maxoff = PageGetMaxOffsetNumber(page);
 
 		/*
@@ -1123,10 +1141,11 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		 * instead of doing a second scan.
 		 */
 		if (nindexes == 0 &&
-			vacrelstats->num_dead_tuples > 0)
+			vacrelstats->dead_tuples.num_entries > 0)
 		{
 			/* Remove tuples from heap */
-			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			Assert(vacrelstats->dead_tuples.last_seg == 0); /* Should not need more than one segment per page */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, DeadTuplesCurrentSegment(vacrelstats), &vmbuffer);
 			has_dead_tuples = false;
 
 			/*
@@ -1134,7 +1153,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			 * not to reset latestRemovedXid since we want that value to be
 			 * valid.
 			 */
-			vacrelstats->num_dead_tuples = 0;
+			lazy_clear_dead_tuples(vacrelstats);
 			vacuumed_pages++;
 		}
 
@@ -1237,7 +1256,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		 * page, so remember its free space as-is.  (This path will always be
 		 * taken if there are no indexes.)
 		 */
-		if (vacrelstats->num_dead_tuples == prev_dead_count)
+		if (vacrelstats->dead_tuples.num_entries == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
@@ -1268,7 +1287,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
-	if (vacrelstats->num_dead_tuples > 0)
+	if (vacrelstats->dead_tuples.num_entries > 0)
 	{
 		const int	hvp_index[] = {
 			PROGRESS_VACUUM_PHASE,
@@ -1362,7 +1381,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 {
+	int			tottuples;
 	int			tupindex;
+	int			segindex;
 	int			npages;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -1370,35 +1391,41 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	pg_rusage_init(&ru0);
 	npages = 0;
 
-	tupindex = 0;
-	while (tupindex < vacrelstats->num_dead_tuples)
-	{
-		BlockNumber tblk;
-		Buffer		buf;
-		Page		page;
-		Size		freespace;
+	segindex = 0;
+	tottuples = 0;
+	for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++) {
+		DeadTuplesSegment *seg = &(vacrelstats->dead_tuples.dead_tuples[segindex]);
+		int num_dead_tuples = seg->num_dead_tuples;
+		while (tupindex < num_dead_tuples)
+		{
+			BlockNumber tblk;
+			Buffer		buf;
+			Page		page;
+			Size		freespace;
 
-		vacuum_delay_point();
+			vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
-								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
-									&vmbuffer);
+			tblk = ItemPointerGetBlockNumber(&seg->dead_tuples[tupindex]);
+			buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+									 vac_strategy);
+			if (!ConditionalLockBufferForCleanup(buf))
+			{
+				ReleaseBuffer(buf);
+				++tupindex;
+				continue;
+			}
+			tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+										seg, &vmbuffer);
 
-		/* Now that we've compacted the page, record its available space */
-		page = BufferGetPage(buf);
-		freespace = PageGetHeapFreeSpace(page);
+			/* Now that we've compacted the page, record its available space */
+			page = BufferGetPage(buf);
+			freespace = PageGetHeapFreeSpace(page);
 
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace);
-		npages++;
+			UnlockReleaseBuffer(buf);
+			RecordPageWithFreeSpace(onerel, tblk, freespace);
+			npages++;
+		}
+		tottuples += tupindex;
 	}
 
 	if (BufferIsValid(vmbuffer))
@@ -1410,7 +1437,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	ereport(elevel,
 			(errmsg("\"%s\": removed %d row versions in %d pages",
 					RelationGetRelationName(onerel),
-					tupindex, npages),
+					tottuples, npages),
 			 errdetail("%s.",
 					   pg_rusage_show(&ru0))));
 }
@@ -1421,34 +1448,36 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
  *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
+ * tupindex is the index in seg->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
  */
 static int
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+				 int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment *seg, Buffer *vmbuffer)
 {
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxOffsetNumber];
 	int			uncnt = 0;
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
+	ItemPointer	dead_tuples = seg->dead_tuples;
+	int		num_dead_tuples = seg->num_dead_tuples;
 
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	START_CRIT_SECTION();
 
-	for (; tupindex < vacrelstats->num_dead_tuples; tupindex++)
+	for (; tupindex < num_dead_tuples; tupindex++)
 	{
 		BlockNumber tblk;
 		OffsetNumber toff;
 		ItemId		itemid;
 
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]);
+		tblk = ItemPointerGetBlockNumber(&dead_tuples[tupindex]);
 		if (tblk != blkno)
 			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&vacrelstats->dead_tuples[tupindex]);
+		toff = ItemPointerGetOffsetNumber(&dead_tuples[tupindex]);
 		itemid = PageGetItemId(page, toff);
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
@@ -1582,6 +1611,7 @@ lazy_vacuum_index(Relation indrel,
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
+	DeadTuplesSegment	*seg;
 
 	pg_rusage_init(&ru0);
 
@@ -1592,6 +1622,11 @@ lazy_vacuum_index(Relation indrel,
 	ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
 	ivinfo.strategy = vac_strategy;
 
+	/* Finalize the current segment by setting its upper bound dead tuple */
+	seg = DeadTuplesCurrentSegment(vacrelstats);
+	if (seg->num_dead_tuples > 0)
+		seg->last_dead_tuple = seg->dead_tuples[seg->num_dead_tuples-1];
+
 	/* Do bulk deletion */
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) vacrelstats);
@@ -1599,7 +1634,7 @@ lazy_vacuum_index(Relation indrel,
 	ereport(elevel,
 			(errmsg("scanned index \"%s\" to remove %d row versions",
 					RelationGetRelationName(indrel),
-					vacrelstats->num_dead_tuples),
+					vacrelstats->dead_tuples.num_entries),
 			 errdetail("%s.", pg_rusage_show(&ru0))));
 }
 
@@ -1962,7 +1997,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 static void
 lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 {
-	long		maxtuples;
+	long		maxtuples, mintuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
@@ -1971,7 +2006,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	{
 		maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
 		maxtuples = Min(maxtuples, INT_MAX);
-		maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
 
 		/* curious coding here to ensure the multiplication can't overflow */
 		if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -1985,10 +2019,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 		maxtuples = MaxHeapTuplesPerPage;
 	}
 
-	vacrelstats->num_dead_tuples = 0;
-	vacrelstats->max_dead_tuples = (int) maxtuples;
-	vacrelstats->dead_tuples = (ItemPointer)
-		palloc(maxtuples * sizeof(ItemPointerData));
+	mintuples = Min(LAZY_MIN_TUPLES, maxtuples);
+
+	vacrelstats->dead_tuples.num_entries = 0;
+	vacrelstats->dead_tuples.max_entries = (int) maxtuples;
+	vacrelstats->dead_tuples.num_segs = 1;
+	vacrelstats->dead_tuples.last_seg = 0;
+	vacrelstats->dead_tuples.dead_tuples = (DeadTuplesSegment*)
+		palloc(sizeof(DeadTuplesSegment));
+	vacrelstats->dead_tuples.dead_tuples[0].dead_tuples = (ItemPointer)
+		palloc(mintuples * sizeof(ItemPointerData));
+	vacrelstats->dead_tuples.dead_tuples[0].max_dead_tuples = mintuples;
+	vacrelstats->dead_tuples.dead_tuples[0].num_dead_tuples = 0;
 }
 
 /*
@@ -2003,16 +2045,73 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	 * could if we are given a really small maintenance_work_mem. In that
 	 * case, just forget the last few tuples (we'll get 'em next time).
 	 */
-	if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples)
+	if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries)
 	{
-		vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
-		vacrelstats->num_dead_tuples++;
+		DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats);
+		if (seg->num_dead_tuples >= seg->max_dead_tuples) {
+			/*
+			 * The segment is overflowing, so we must allocate a new segment.
+			 * We could have a preallocated segment descriptor already, in
+			 * which case we just reinitialize it, or we may need to repalloc
+			 * the vacrelstats->dead_tuples array. In that case, seg will
+			 * no longer be valid, so we must be careful about that.
+			 * In any case, we must update the last_dead_tuple copy in the
+			 * overflowing segment descriptor.
+			 */
+			Assert(seg->num_dead_tuples == seg->max_dead_tuples);
+			seg->last_dead_tuple = seg->dead_tuples[seg->num_dead_tuples-1];
+			if (vacrelstats->dead_tuples.last_seg+1 >= vacrelstats->dead_tuples.num_segs) {
+				int new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
+				vacrelstats->dead_tuples.dead_tuples = (DeadTuplesSegment*) repalloc(
+					(void*) vacrelstats->dead_tuples.dead_tuples,
+					new_num_segs * sizeof(DeadTuplesSegment));
+				while (vacrelstats->dead_tuples.num_segs < new_num_segs) {
+					/* Initialize as "unallocated" */
+					DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dead_tuples[
+						vacrelstats->dead_tuples.num_segs]);
+					nseg->num_dead_tuples = 0;
+					nseg->max_dead_tuples = 0;
+					nseg->dead_tuples = NULL;
+					vacrelstats->dead_tuples.num_segs++;
+				}
+				seg = DeadTuplesCurrentSegment(vacrelstats);
+			}
+			vacrelstats->dead_tuples.last_seg++;
+			seg = DeadTuplesCurrentSegment(vacrelstats);
+			if (seg->dead_tuples == NULL) {
+				/* If unallocated, allocate up to twice the amount of the previous segment */
+				DeadTuplesSegment *pseg = seg - 1;
+				int numtuples = vacrelstats->dead_tuples.max_entries - vacrelstats->dead_tuples.num_entries;
+				numtuples = Max(numtuples, MaxHeapTuplesPerPage);
+				numtuples = Min(numtuples, INT_MAX/2);
+				numtuples = Min(numtuples, 2 * pseg->max_dead_tuples);
+				numtuples = Min(numtuples, MaxAllocSize / sizeof(ItemPointerData));
+				seg->dead_tuples = (ItemPointer) palloc(sizeof(ItemPointerData) * numtuples);
+				seg->max_dead_tuples = numtuples;
+			}
+			seg->num_dead_tuples = 0;
+		}
+		seg->dead_tuples[seg->num_dead_tuples] = *itemptr;
+		vacrelstats->dead_tuples.num_entries++;
+		seg->num_dead_tuples++;
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 vacrelstats->num_dead_tuples);
+									 vacrelstats->dead_tuples.num_entries);
 	}
 }
 
 /*
+ * lazy_clear_dead_tuples - reset all dead tuples segments
+ */
+static void
+lazy_clear_dead_tuples(LVRelStats *vacrelstats)
+{
+	int nseg;
+	for (nseg = 0; nseg < vacrelstats->dead_tuples.last_seg; nseg++)
+		vacrelstats->dead_tuples.dead_tuples[nseg].num_dead_tuples = 0;
+	vacrelstats->dead_tuples.last_seg = 0;
+}
+
+/*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
@@ -2024,10 +2123,43 @@ lazy_tid_reaped(ItemPointer itemptr, void *state)
 {
 	LVRelStats *vacrelstats = (LVRelStats *) state;
 	ItemPointer res;
+	DeadTuplesSegment *seg, *rseg;
+	int nseg, last_seg;
+
+	if (vacrelstats->dead_tuples.num_segs == 0)
+		return false;
+
+	/*
+	 * Do a sequential search to find the first segment that could contain the item pointer.
+	 * Given the number of segments involved, binary search would be pointless.
+	 */
+	seg = vacrelstats->dead_tuples.dead_tuples;
+	rseg = NULL;
+	last_seg = vacrelstats->dead_tuples.last_seg;
+	for (nseg = 0; nseg <= last_seg; nseg++, seg++) {
+		int rcomp = vac_cmp_itemptr((void*)itemptr, (void*)&(seg->last_dead_tuple));
+		if (rcomp == 0) {
+			/* Found it, what are the chances! */
+			return true;
+		} else if (rcomp < 0) {
+			/* Found the right segment, further segments won't overlap */
+			rseg = seg;
+			break;
+		}
+	}
+	if (rseg == NULL || rseg->num_dead_tuples == 0)
+		return false;
+
+	/*
+	 * Quickly rule out by lower bound (should happen a lot)
+	 * Upper bound was already checked by segment search
+	 */
+	if (vac_cmp_itemptr((void*)itemptr, (void*)rseg->dead_tuples) < 0)
+		return false;
 
 	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) vacrelstats->dead_tuples,
-								vacrelstats->num_dead_tuples,
+								(void *) rseg->dead_tuples,
+								rseg->num_dead_tuples,
 								sizeof(ItemPointerData),
 								vac_cmp_itemptr);
 
@@ -2037,7 +2169,7 @@ lazy_tid_reaped(ItemPointer itemptr, void *state)
 /*
  * Comparator routines for use with qsort() and bsearch().
  */
-static int
+static inline int
 vac_cmp_itemptr(const void *left, const void *right)
 {
 	BlockNumber lblk,
-- 
2.6.6

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to