While doing the code coverage testing of v7 patch shared with - [1], I
found that there are few lines of code in _hash_next() that are
redundant and needs to be removed. I basically came to know this while
testing the scenario where a hash index scan starts when a split is in
progress. I have removed those lines and attached is the newer version
of patch.

[1] - 
https://www.postgresql.org/message-id/CAE9k0Pmn92Le0iOTU5b6087eRXzUnkq1PKcihxtokHJ9boXCBg%40mail.gmail.com

On Mon, May 8, 2017 at 6:55 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> Hi,
>
> On Tue, Apr 4, 2017 at 3:56 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coe...@gmail.com> 
>> wrote:
>>>
>>> Please note that these patches needs to be applied on top of  [1].
>>>
>>
>> Few more review comments:
>>
>> 1.
>> - page = BufferGetPage(so->hashso_curbuf);
>> + blkno = so->currPos.currPage;
>> + if (so->hashso_bucket_buf == so->currPos.buf)
>> + {
>> + buf = so->currPos.buf;
>> + LockBuffer(buf, BUFFER_LOCK_SHARE);
>> + page = BufferGetPage(buf);
>> + }
>>
>> Here, you are assuming that only bucket page can be pinned, but I
>> think that assumption is not right.  When _hash_kill_items() is called
>> before moving to next page, there could be a pin on the overflow page.
>> You need some logic to check if the buffer is pinned, then just lock
>> it.  I think once you do that, it might not be convinient to release
>> the pin at the end of this function.
>
> Yes, there are few cases where we might have pin on overflow pages too
> and we need to handle such cases in _hash_kill_items. I have taken
> care of this in the attached v7 patch. Thanks.
>
>>
>> Add some comments on top of _hash_kill_items to explain the new
>> changes or say some thing like "See _bt_killitems for details"
>
> Added some more comments on the new changes.
>
>>
>> 2.
>> + /*
>> + * We save the LSN of the page as we read it, so that we know whether it
>> + * safe to apply LP_DEAD hints to the page later.  This allows us to drop
>> + * the pin for MVCC scans, which allows vacuum to avoid blocking.
>> + */
>> + so->currPos.lsn = PageGetLSN(page);
>> +
>>
>> The second part of above comment doesn't make sense because vacuum's
>> will still be blocked because we hold the pin on primary bucket page.
>
> That's right. It doesn't make sense because we won't allow vacuum to
> start. I have removed it.
>
>>
>> 3.
>> + {
>> + /*
>> + * No more matching tuples were found. return FALSE
>> + * indicating the same. Also, remember the prev and
>> + * next block number so that if fetching tuples using
>> + * cursor we remember the page from where to start the
>> + * scan.
>> + */
>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>
>> You can't read opaque without having lock and by this time it has
>> already been released.
>
> I have corrected it. Please refer to the attached v7 patch.
>
> Also, I think if you want to save position for
>> cursor movement, then you need to save the position of last bucket
>> when scan completes the overflow chain, however as you have written it
>> will be always invalid block number. I think there is similar problem
>> with prevblock number.
>
> Did you mean last bucket or last page. If it is last page, then I am
> already storing it.
>
>>
>> 4.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber 
>> offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>> +{
>> + HashScanOpaque so = (HashScanOpaque) scan->opaque;
>> + IndexTuple      itup;
>> + int itemIndex;
>> +
>> + if (ScanDirectionIsForward(dir))
>> + {
>> + /* load items[] in ascending order */
>> + itemIndex = 0;
>> +
>> + /* new page, relocate starting position by binary search */
>> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
>>
>> What is the need to find offset number when this function already has
>> that as an input parameter?
>
> It's not required. I have removed it.
>
>>
>> 5.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber 
>> offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>>
>> I think maxoff is not required to be passed a parameter to this
>> function as you need it only for forward scan. You can compute it
>> locally.
>
> It is required for both forward and backward scan. However, I am not
> passing it to _hash_load_qualified_items() now.
>
>>
>> 6.
>> +_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber 
>> offnum,
>> +   OffsetNumber maxoff, ScanDirection dir)
>> {
>> ..
>> + if (ScanDirectionIsForward(dir))
>> + {
>> ..
>> + while (offnum <= maxoff)
>> {
>> ..
>> + if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
>> + _hash_checkqual(scan, itup))
>> + {
>> + /* tuple is qualified, so remember it */
>> + _hash_saveitem(so, itemIndex, offnum, itup);
>> + itemIndex++;
>> + }
>> +
>> + offnum = OffsetNumberNext(offnum);
>> ..
>>
>> Why are you traversing the whole page even when there is no match?
>> There is a similar problem in backward scan. I think this is blunder.
>
> Fixed. Please check the attached
> '0001-Rewrite-hash-index-scans-to-work-a-page-at-a-timev7.patch'
>
>>
>> 7.
>> + if (so->currPos.buf == so->hashso_bucket_buf ||
>> + so->currPos.buf == so->hashso_split_bucket_buf)
>> + {
>> + so->currPos.prevPage = InvalidBlockNumber;
>> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
>> + }
>> + else
>> + {
>> + so->currPos.prevPage = (opaque)->hasho_prevblkno;
>> + _hash_relbuf(rel, so->currPos.buf);
>> + }
>> +
>> + so->currPos.nextPage = (opaque)->hasho_nextblkno;
>>
>> What makes you think it is safe read opaque after releasing the lock?
>
> Nothing makes me think to read opaque after releasing lock. It's a
> mistake. I have corrected it. Please check attached v7 patch.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
From 3b4cad90446e29613b3574b64d9361774a0f7210 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 10 May 2017 12:25:57 +0530
Subject: [PATCH] Rewrite hash index scan to work page at a timev8

Patch by Ashutosh Sharma <ashu.coe...@gmail.com>
---
 src/backend/access/hash/README       |  25 +-
 src/backend/access/hash/hash.c       | 156 +++---------
 src/backend/access/hash/hashpage.c   |  10 +-
 src/backend/access/hash/hashsearch.c | 445 +++++++++++++++++++++++++++++++----
 src/backend/access/hash/hashutil.c   |  70 +++++-
 src/include/access/hash.h            |  50 +++-
 6 files changed, 562 insertions(+), 194 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index c8a0ec7..eef7d66 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,10 +259,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
-     the pin on the primary bucket throughout the scan; we also maintain
-     a pin on the page currently being scanned)
-	get tuple
-	release content lock
+     the pin on the primary bucket throughout the scan)
+    save all the matching tuples from current index page into an items array
+    release pin and content lock (but if it is primary bucket page retain
+    it's pin till the end of scan)
+    get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
@@ -270,15 +271,13 @@ Holding the buffer pin on the primary bucket page for the whole scan prevents
 the reader's current-tuple pointer from being invalidated by splits or
 compactions.  (Of course, other buckets can still be split or compacted.)
 
-To keep concurrency reasonably good, we require readers to cope with
-concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the buffer content lock on
-page.  Since deletion is not possible while a reader holds the pin on bucket,
-and we assume that heap tuple TIDs are unique, this can be implemented by
-searching for the same heap tuple TID previously returned.  Insertion does
-not move index entries across pages, so the previously-returned index entry
-should always be on the same page, at the same or higher offset number,
-as it was before.
+To minimize lock/unlock traffic, hash index scan always searches entire hash
+page to identify all the matching items at once, copying their heap tuple IDs
+into backend-local storage. The heap tuple IDs are then processed while not
+holding any page lock within the index thereby, allowing concurrent insertion
+to happen on a same index page without any requirement of re-finding the current
+scan position for reader. We do continue to hold a pin on the bucket page, to
+protect against concurrent deletions and bucket split.
 
 To allow for scans during a bucket split, if at the start of the scan, the
 bucket is marked as bucket-being-populated, it scan all the tuples in that
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index df54638..a34b812 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,22 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-				break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
-				 RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tuple)
@@ -341,16 +297,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 			 * instead, we just forget any excess entries.
 			 */
 			if (so->killedItems == NULL)
-				so->killedItems = palloc(MaxIndexTuplesPerPage *
-										 sizeof(HashScanPosItem));
+				so->killedItems = (int *)
+					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
-			{
-				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
-				so->killedItems[so->numKilled].indexOffset =
-							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-				so->numKilled++;
-			}
+				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 		}
 
 		/*
@@ -358,30 +309,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		 */
 		res = _hash_next(scan, dir);
 	}
-	else
-		res = _hash_first(scan, dir);
-
-	/*
-	 * Skip killed tuples if asked to.
-	 */
-	if (scan->ignore_killed_tuples)
-	{
-		while (res)
-		{
-			offnum = ItemPointerGetOffsetNumber(current);
-			page = BufferGetPage(so->hashso_curbuf);
-			if (!ItemIdIsDead(PageGetItemId(page, offnum)))
-				break;
-			res = _hash_next(scan, dir);
-		}
-	}
-
-	/* Release read lock on current buffer, but keep it pinned */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
-	scan->xs_ctup.t_self = so->hashso_heappos;
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return res;
 }
@@ -396,35 +327,22 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	bool		res;
 	int64		ntids = 0;
+	HashScanPosItem *currItem;
 
 	res = _hash_first(scan, ForwardScanDirection);
 
 	while (res)
 	{
-		bool		add_tuple;
+		currItem = &so->currPos.items[so->currPos.itemIndex];
 
 		/*
-		 * Skip killed tuples if asked to.
+		 * _hash_first() or _hash_next() never returns
+		 * dead tuples. Therefore, we can always add
+		 * the tuples into TIDBitmap without checking
+		 * if a tuple is dead or not.
 		 */
-		if (scan->ignore_killed_tuples)
-		{
-			Page		page;
-			OffsetNumber offnum;
-
-			offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
-			page = BufferGetPage(so->hashso_curbuf);
-			add_tuple = !ItemIdIsDead(PageGetItemId(page, offnum));
-		}
-		else
-			add_tuple = true;
-
-		/* Save tuple ID, and continue scanning */
-		if (add_tuple)
-		{
-			/* Note we mark the tuple ID as requiring recheck */
-			tbm_add_tuples(tbm, &(so->hashso_heappos), 1, true);
-			ntids++;
-		}
+		tbm_add_tuples(tbm, &(currItem->heapTid), 1, true);
+		ntids++;
 
 		res = _hash_next(scan, ForwardScanDirection);
 	}
@@ -448,12 +366,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	scan = RelationGetIndexScan(rel, nkeys, norderbys);
 
 	so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
-	so->hashso_curbuf = InvalidBuffer;
+	HashScanPosInvalidate(so->currPos);
 	so->hashso_bucket_buf = InvalidBuffer;
 	so->hashso_split_bucket_buf = InvalidBuffer;
-	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
 
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
@@ -476,23 +391,16 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	/* set position invalid (this will cause _hash_first call) */
-	ItemPointerSetInvalid(&(so->hashso_curpos));
-	ItemPointerSetInvalid(&(so->hashso_heappos));
+	HashScanPosInvalidate(so->currPos);
 
 	/* Update scan key, if a new one is given */
 	if (scankey && scan->numberOfKeys > 0)
@@ -515,20 +423,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/*
-	 * Before leaving current page, deal with any killed items.
-	 * Also, ensure that we acquire lock on current page before
-	 * calling _hash_kill_items.
-	 */
-	if (so->numKilled > 0)
+	if (HashScanPosIsValid(so->currPos))
 	{
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-		_hash_kill_items(scan);
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_hash_kill_items(scan, false);
+		_hash_dropscanbuf(rel, so);
 	}
 
-	_hash_dropscanbuf(rel, so);
-
 	if (so->killedItems != NULL)
 		pfree(so->killedItems);
 	pfree(so);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index bf1ffff..78141e2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -298,20 +298,20 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 {
 	/* release pin we hold on primary bucket page */
 	if (BufferIsValid(so->hashso_bucket_buf) &&
-		so->hashso_bucket_buf != so->hashso_curbuf)
+		so->hashso_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_bucket_buf);
 	so->hashso_bucket_buf = InvalidBuffer;
 
 	/* release pin we hold on primary bucket page  of bucket being split */
 	if (BufferIsValid(so->hashso_split_bucket_buf) &&
-		so->hashso_split_bucket_buf != so->hashso_curbuf)
+		so->hashso_split_bucket_buf != so->currPos.buf)
 		_hash_dropbuf(rel, so->hashso_split_bucket_buf);
 	so->hashso_split_bucket_buf = InvalidBuffer;
 
 	/* release any pin we still hold */
-	if (BufferIsValid(so->hashso_curbuf))
-		_hash_dropbuf(rel, so->hashso_curbuf);
-	so->hashso_curbuf = InvalidBuffer;
+	if (BufferIsValid(so->currPos.buf))
+		_hash_dropbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
 	/* reset split scan */
 	so->hashso_buc_populated = false;
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..13b1d8d 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,44 +20,108 @@
 #include "pgstat.h"
 #include "utils/rel.h"
 
+static bool _hash_readpage(IndexScanDesc scan, Buffer *bufP,
+			ScanDirection dir);
+static int _hash_load_qualified_items(IndexScanDesc scan, Page page,
+			OffsetNumber offnum, ScanDirection dir);
+static inline void _hash_saveitem(HashScanOpaque so, int itemIndex,
+			OffsetNumber offnum, IndexTuple itup);
+static void _hash_readnext(IndexScanDesc scan, Buffer *bufp,
+			Page *pagep, HashPageOpaque *opaquep);
 
 /*
  *	_hash_next() -- Get the next item in a scan.
  *
- *		On entry, we have a valid hashso_curpos in the scan, and a
- *		pin and read lock on the page that contains that item.
- *		We find the next item in the scan, if any.
- *		On success exit, we have the page containing the next item
- *		pinned and locked.
+ *		On entry, so->currPos describes the current page, which may
+ *		be pinned but not locked, and so->currPos.itemIndex identifies
+ *		which item was previously returned.
+ *
+ *		On successful exit, scan->xs_ctup.t_self is set to the TID
+ *		of the next heap tuple, and if requested, scan->xs_itup
+ *		points to a copy of the index tuple.  so->currPos is updated
+ *		as needed.
+ *
+ *		On failure exit (no more tuples), we return FALSE with no
+ *		pins or locks held.
  */
 bool
 _hash_next(IndexScanDesc scan, ScanDirection dir)
 {
 	Relation	rel = scan->indexRelation;
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	HashScanPosItem *currItem;
+	BlockNumber blkno;
 	Buffer		buf;
-	Page		page;
-	OffsetNumber offnum;
-	ItemPointer current;
-	IndexTuple	itup;
-
-	/* we still have the buffer pinned and read-locked */
-	buf = so->hashso_curbuf;
-	Assert(BufferIsValid(buf));
+	bool		end_of_scan = false;
 
 	/*
-	 * step to next valid tuple.
+	 * Advance to next tuple on current page; or if there's no more,
+	 * try to read data from next or prev page based on the scan
+	 * direction. Before moving to the next or prev page make sure
+	 * that we deal with all the killed items.
 	 */
-	if (!_hash_step(scan, &buf, dir))
+	if (ScanDirectionIsForward(dir))
+	{
+		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.nextPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+	else
+	{
+		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, false);
+
+			blkno = so->currPos.prevPage;
+			if (BlockNumberIsValid(blkno))
+			{
+				buf = _hash_getbuf(rel, blkno, HASH_READ,
+								   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+				so->currPos.buf = buf;
+				TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(buf));
+
+				/*
+				 * We always maintain the pin on bucket page for whole scan
+				 * operation, so releasing the additional pin we have acquired
+				 * here.
+				 */
+				if (buf == so->hashso_bucket_buf ||
+					buf == so->hashso_split_bucket_buf)
+					_hash_dropbuf(rel, buf);
+
+				if (!_hash_readpage(scan, &buf, dir))
+					end_of_scan = true;
+			}
+			else
+				end_of_scan = true;
+		}
+	}
+
+	if (end_of_scan)
+	{
+		_hash_dropscanbuf(rel, so);
+		HashScanPosInvalidate(so->currPos);
 		return false;
+	}
 
-	/* if we're here, _hash_step found a valid tuple */
-	current = &(so->hashso_curpos);
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* OK, itemIndex says what to return */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_ctup.t_self = currItem->heapTid;
 
 	return true;
 }
@@ -212,11 +276,15 @@ _hash_readprev(IndexScanDesc scan,
 /*
  *	_hash_first() -- Find the first item in a scan.
  *
- *		Find the first item in the index that
- *		satisfies the qualification associated with the scan descriptor. On
- *		success, the page containing the current index tuple is read locked
- *		and pinned, and the scan's opaque data entry is updated to
- *		include the buffer.
+ *		We find the first item (or, if backward scan, the last item) in
+ *		the index that satisfies the qualification associated with the
+ *		scan descriptor. On success, the page containing the current
+ *		index tuple is read locked and pinned, and data about the
+ *		matching tuple(s) on the page has been loaded into so->currPos,
+ *		scan->xs_ctup.t_self is set to the heap TID of the current tuple.
+ *
+ *		If there are no matching items in the index, we return FALSE,
+ *		with no pins or locks held.
  */
 bool
 _hash_first(IndexScanDesc scan, ScanDirection dir)
@@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	Buffer		buf;
 	Page		page;
 	HashPageOpaque opaque;
-	IndexTuple	itup;
-	ItemPointer current;
-	OffsetNumber offnum;
 
 	pgstat_count_index_scan(rel);
 
-	current = &(so->hashso_curpos);
-	ItemPointerSetInvalid(current);
-
 	/*
 	 * We do not support hash scans with no index qualification, because we
 	 * would have to read the whole index rather than just one bucket. That
@@ -356,17 +418,15 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 			_hash_readnext(scan, &buf, &page, &opaque);
 	}
 
-	/* Now find the first tuple satisfying the qualification */
-	if (!_hash_step(scan, &buf, dir))
-		return false;
+	/* remember which buffer we have pinned, if any */
+	Assert(BufferIsInvalid(so->currPos.buf));
+	so->currPos.buf = buf;
 
-	/* if we're here, _hash_step found a valid tuple */
-	offnum = ItemPointerGetOffsetNumber(current);
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-	so->hashso_heappos = itup->t_tid;
+	/* Now find all the tuples satisfying the qualification from a page */
+	if (!_hash_readpage(scan, &buf, dir))
+		return false;
 
+	/* if we're here, _hash_readpage found a valid tuples */
 	return true;
 }
 
@@ -575,3 +635,304 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 	ItemPointerSet(current, blkno, offnum);
 	return true;
 }
+
+/*
+ *	_hash_readpage() -- Load data from current index page into so->currPos
+ *
+ *	We scan all the items in the current index page and save them into
+ *	so->currPos if it satifies the qualification. If no matching items
+ *	are found in the current page, we move to the next or previous page
+ *	in a bucket chain as indicated by the direction.
+ *
+ *	Return true if any matching items are found else return false.
+ */
+static bool
+_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
+{
+	Relation	rel = scan->indexRelation;
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	Buffer		buf;
+	Page		page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum;
+	uint16			itemIndex;
+
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	buf = *bufP;
+	Assert(BufferIsValid(buf));
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * We save the LSN of the page as we read it, so that we
+	 * know whether it safe to apply LP_DEAD hints to the
+	 * page later.
+	 */
+	so->currPos.lsn = PageGetLSN(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == 0)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the next page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the end of bucket chain without finding any
+			 * matching tuples. See comments in else part below.
+			 */
+			if (!BlockNumberIsValid((opaque)->hasho_nextblkno))
+			{
+				so->currPos.prevPage = (opaque)->hasho_prevblkno;
+				so->currPos.nextPage = InvalidBlockNumber;
+			}
+
+			_hash_readnext(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = 0;
+		so->currPos.lastItem = itemIndex - 1;
+		so->currPos.itemIndex = 0;
+	}
+	else
+	{
+		/* new page, locate starting position by binary search */
+		offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+
+		itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
+
+		while (itemIndex == MaxIndexTuplesPerPage)
+		{
+			/*
+			 * Could not find any matching tuples in the current page, move
+			 * to the prev page. Before leaving the current page, also deal
+			 * with any killed items.
+			 */
+			if (so->numKilled > 0)
+				_hash_kill_items(scan, true);
+
+			/*
+			 * We remember prev and next block number along with
+			 * current block number so that if fetching the tup-
+			 * les using cursor we know the page from where to
+			 * startwith. This is for the case where we have re-
+			 * ached the bucket page without finding any matching
+			 * tuples. See comments in else part below.
+			 */
+			if (so->currPos.buf == so->hashso_bucket_buf ||
+				so->currPos.buf == so->hashso_split_bucket_buf)
+			{
+				so->currPos.prevPage = InvalidBlockNumber;
+				so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			}
+
+			_hash_readprev(scan, &buf, &page, &opaque);
+			if (BufferIsValid(buf))
+			{
+				so->currPos.buf = buf;
+				so->currPos.currPage = BufferGetBlockNumber(buf);
+				so->currPos.lsn = PageGetLSN(page);
+				offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
+				itemIndex = _hash_load_qualified_items(scan, page,
+													   offnum, dir);
+			}
+			else
+			{
+				/*
+				 * No more matching tuples were found. return FALSE
+				 * indicating the same.
+				 */
+				so->currPos.buf = buf;
+				return false;
+			}
+		}
+
+		if (so->currPos.buf == so->hashso_bucket_buf ||
+			so->currPos.buf == so->hashso_split_bucket_buf)
+		{
+			so->currPos.prevPage = InvalidBlockNumber;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+		else
+		{
+			so->currPos.prevPage = (opaque)->hasho_prevblkno;
+			so->currPos.nextPage = (opaque)->hasho_nextblkno;
+			_hash_relbuf(rel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+
+		so->currPos.firstItem = itemIndex;
+		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+	}
+
+	return (so->currPos.firstItem <= so->currPos.lastItem);
+}
+
+/*
+ * Load all the qualified items from a current index page
+ * into so->currPos. Helper function for _hash_readpage.
+ */
+static int
+_hash_load_qualified_items(IndexScanDesc scan, Page page,
+						   OffsetNumber offnum, ScanDirection dir)
+{
+	HashScanOpaque so = (HashScanOpaque) scan->opaque;
+	IndexTuple      itup;
+	int				itemIndex;
+	OffsetNumber	maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (ScanDirectionIsForward(dir))
+	{
+		/* load items[] in ascending order */
+		itemIndex = 0;
+
+		while (offnum <= maxoff)
+		{
+			Assert(offnum >= FirstOffsetNumber);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberNext(offnum);	/* move forward */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+				itemIndex++;
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberNext(offnum);
+		}
+
+		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		return itemIndex;
+	}
+	else
+	{
+		/* load items[] in descending order */
+		itemIndex = MaxIndexTuplesPerPage;
+
+		while (offnum >= FirstOffsetNumber)
+		{
+			Assert(offnum <= maxoff);
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+			/*
+			 * skip the tuples that are moved by split operation
+			 * for the scan that has started when split was in
+			 * progress. Also, skip the tuples that are marked
+			 * as dead.
+			 */
+			if ((so->hashso_buc_populated && !so->hashso_buc_split &&
+				(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
+				(scan->ignore_killed_tuples &&
+				(ItemIdIsDead(PageGetItemId(page, offnum)))))
+			{
+				offnum = OffsetNumberPrev(offnum);	/* move back */
+				continue;
+			}
+
+			if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+				_hash_checkqual(scan, itup))
+			{
+				itemIndex--;
+				/* tuple is qualified, so remember it */
+				_hash_saveitem(so, itemIndex, offnum, itup);
+			}
+			else
+				/*
+				 * No more matching tuples exist in this page. so, exit
+				 * while loop.
+				 */
+				break;
+
+			offnum = OffsetNumberPrev(offnum);
+		}
+
+		Assert(itemIndex >= 0);
+		return itemIndex;
+	}
+}
+
+/* Save an index item into so->currPos.items[itemIndex] */
+static inline void
+_hash_saveitem(HashScanOpaque so, int itemIndex,
+			   OffsetNumber offnum, IndexTuple itup)
+{
+	HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = itup->t_tid;
+	currItem->indexOffset = offnum;
+}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 9f832f2..e5c3b1e 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -522,13 +522,28 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint).
+ *
+ * The caller must have pin on bucket buffer, but may or may not have pin
+ * on overflow buffer, as indicated by havePin.
+ *
  * We match items by heap TID before assuming they are the right ones to
  * delete.
+ *
+ * Note that we keep pin on the bucket page throughout the scan. Hence,
+ * there is no chance of VACUUM deleting any items from the page.
+ *
+ * See _bt_killitems() for more details.
  */
 void
-_hash_kill_items(IndexScanDesc scan)
+_hash_kill_items(IndexScanDesc scan, bool havePin)
 {
 	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Relation    rel = scan->indexRelation;
+	BlockNumber blkno;
+	Buffer  buf;
 	Page	page;
 	HashPageOpaque	opaque;
 	OffsetNumber	offnum, maxoff;
@@ -538,6 +553,7 @@ _hash_kill_items(IndexScanDesc scan)
 
 	Assert(so->numKilled > 0);
 	Assert(so->killedItems != NULL);
+	Assert(HashScanPosIsValid(so->currPos));
 
 	/*
 	 * Always reset the scan state, so we don't look for same
@@ -545,20 +561,58 @@ _hash_kill_items(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	page = BufferGetPage(so->hashso_curbuf);
+	blkno = so->currPos.currPage;
+	if (so->hashso_bucket_buf == so->currPos.buf)
+	{
+		buf = so->currPos.buf;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buf);
+	}
+	else
+	{
+		if (!havePin)
+			buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
+		else
+		{
+			buf = so->currPos.buf;
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+		}
+
+		/* It might not exist anymore; in which case we can't hint it. */
+		if (!BufferIsValid(buf))
+			return;
+
+		/*
+		 * If page LSN differs it means that the page was modified since the
+		 * last read. killedItems could be not valid so LP_DEAD hints apply-
+		 * ing is not safe.
+		 */
+		page = BufferGetPage(buf);
+		if (PageGetLSN(page) != so->currPos.lsn)
+		{
+			_hash_relbuf(rel, buf);
+			return;
+		}
+	}
+
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (i = 0; i < numKilled; i++)
 	{
-		offnum = so->killedItems[i].indexOffset;
+		int				itemIndex = so->killedItems[i];
+		HashScanPosItem *currItem = &so->currPos.items[itemIndex];
+		offnum = currItem->indexOffset;
+
+		Assert(itemIndex >= so->currPos.firstItem &&
+			   itemIndex <= so->currPos.lastItem);
 
 		while (offnum <= maxoff)
 		{
 			ItemId	iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -577,6 +631,12 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(so->hashso_curbuf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
+
+	if (so->hashso_bucket_buf == so->currPos.buf ||
+		havePin)
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
 }
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index adba224..ab9ebda 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -103,6 +103,46 @@ typedef struct HashScanPosItem    /* what we remember about each match */
 	OffsetNumber indexOffset;	/* index item's location within page */
 } HashScanPosItem;
 
+typedef struct HashScanPosData
+{
+	Buffer		buf;			/* if valid, the buffer is pinned */
+	XLogRecPtr  lsn;			/* pos in the WAL stream when page was read */
+	BlockNumber currPage;		/* current hash index page */
+	BlockNumber nextPage;		/* next overflow page */
+	BlockNumber prevPage;		/* prev overflow or bucket page */
+
+	/*
+	 * The items array is always ordered in index order (ie, increasing
+	 * indexoffset).  When scanning backwards it is convenient to fill the
+	 * array back-to-front, so we start at the last slot and fill downwards.
+	 * Hence we need both a first-valid-entry and a last-valid-entry counter.
+	 * itemIndex is a cursor showing which entry was last returned to caller.
+	 */
+	int			firstItem;		/* first valid index in items[] */
+	int			lastItem;		/* last valid index in items[] */
+	int			itemIndex;		/* current index in items[] */
+
+	HashScanPosItem items[MaxIndexTuplesPerPage];		/* MUST BE LAST */
+}	HashScanPosData;
+
+#define HashScanPosIsValid(scanpos) \
+( \
+	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+				!BufferIsValid((scanpos).buf)), \
+	BlockNumberIsValid((scanpos).currPage) \
+)
+
+#define HashScanPosInvalidate(scanpos) \
+	do { \
+		(scanpos).buf = InvalidBuffer; \
+		(scanpos).lsn = InvalidXLogRecPtr; \
+		(scanpos).currPage = InvalidBlockNumber; \
+		(scanpos).nextPage = InvalidBlockNumber; \
+		(scanpos).prevPage = InvalidBlockNumber; \
+		(scanpos).firstItem = 0; \
+		(scanpos).lastItem = 0; \
+		(scanpos).itemIndex = 0; \
+	} while (0);
 
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
@@ -145,8 +185,14 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			*killedItems;		/* currPos.items indexes of killed items */
 	int			numKilled;			/* number of currently stored items */
+
+	/*
+	 * Identify all the matching items on a page and save them
+	 * in HashScanPosData
+	 */
+	HashScanPosData	currPos;		/* current position data */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -411,7 +457,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
-extern void _hash_kill_items(IndexScanDesc scan);
+extern void _hash_kill_items(IndexScanDesc scan, bool havePin);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
-- 
1.8.3.1

From 1ec664ac796b5b631fc772849c6de1aa1737f309 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 12 Feb 2017 10:52:22 +0530
Subject: [PATCH] Remove redundant function _hash_step() and some of the unused
 members of HashScanOpaqueData. The function _hash_step() used to find the
 next qualifing tuple in the index page is no more required as new hash index
 scan works page at a time which means it reads all the qualifing tuples in a
 page at once with the help of a new function _hash_readpage().

Patch by Ashutosh Sharma.
---
 src/backend/access/hash/hashsearch.c | 206 -----------------------------------
 src/include/access/hash.h            |  15 ---
 2 files changed, 221 deletions(-)

diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 96da9b5..913a996 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -410,212 +410,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 }
 
 /*
- *	_hash_step() -- step to the next valid item in a scan in the bucket.
- *
- *		If no valid record exists in the requested direction, return
- *		false.  Else, return true and set the hashso_curpos for the
- *		scan to the right thing.
- *
- *		Here we need to ensure that if the scan has started during split, then
- *		skip the tuples that are moved by split while scanning bucket being
- *		populated and then scan the bucket being split to cover all such
- *		tuples.  This is done to ensure that we don't miss tuples in the scans
- *		that are started during split.
- *
- *		'bufP' points to the current buffer, which is pinned and read-locked.
- *		On success exit, we have pin and read-lock on whichever page
- *		contains the right item; on failure, we have released all buffers.
- */
-bool
-_hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
-{
-	Relation	rel = scan->indexRelation;
-	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	ItemPointer current;
-	Buffer		buf;
-	Page		page;
-	HashPageOpaque opaque;
-	OffsetNumber maxoff;
-	OffsetNumber offnum;
-	BlockNumber blkno;
-	IndexTuple	itup;
-
-	current = &(so->hashso_curpos);
-
-	buf = *bufP;
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-	page = BufferGetPage(buf);
-	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
-
-	/*
-	 * If _hash_step is called from _hash_first, current will not be valid, so
-	 * we can't dereference it.  However, in that case, we presumably want to
-	 * start at the beginning/end of the page...
-	 */
-	maxoff = PageGetMaxOffsetNumber(page);
-	if (ItemPointerIsValid(current))
-		offnum = ItemPointerGetOffsetNumber(current);
-	else
-		offnum = InvalidOffsetNumber;
-
-	/*
-	 * 'offnum' now points to the last tuple we examined (if any).
-	 *
-	 * continue to step through tuples until: 1) we get to the end of the
-	 * bucket chain or 2) we find a valid tuple.
-	 */
-	do
-	{
-		switch (dir)
-		{
-			case ForwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberNext(offnum);	/* move forward */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum <= maxoff)
-					{
-						Assert(offnum >= FirstOffsetNumber);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberNext(offnum);	/* move forward */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readnext(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			case BackwardScanDirection:
-				if (offnum != InvalidOffsetNumber)
-					offnum = OffsetNumberPrev(offnum);	/* move back */
-				else
-				{
-					/* new page, locate starting position by binary search */
-					offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-				}
-
-				for (;;)
-				{
-					/*
-					 * check if we're still in the range of items with the
-					 * target hash key
-					 */
-					if (offnum >= FirstOffsetNumber)
-					{
-						Assert(offnum <= maxoff);
-						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-
-						/*
-						 * skip the tuples that are moved by split operation
-						 * for the scan that has started when split was in
-						 * progress
-						 */
-						if (so->hashso_buc_populated && !so->hashso_buc_split &&
-							(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
-						{
-							offnum = OffsetNumberPrev(offnum);	/* move back */
-							continue;
-						}
-
-						if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
-							break;		/* yes, so exit for-loop */
-					}
-
-					/* Before leaving current page, deal with any killed items */
-					if (so->numKilled > 0)
-						_hash_kill_items(scan);
-
-					/*
-					 * ran off the end of this page, try the next
-					 */
-					_hash_readprev(scan, &buf, &page, &opaque);
-					if (BufferIsValid(buf))
-					{
-						TestForOldSnapshot(scan->xs_snapshot, rel, page);
-						maxoff = PageGetMaxOffsetNumber(page);
-						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
-					}
-					else
-					{
-						itup = NULL;
-						break;	/* exit for-loop */
-					}
-				}
-				break;
-
-			default:
-				/* NoMovementScanDirection */
-				/* this should not be reached */
-				itup = NULL;
-				break;
-		}
-
-		if (itup == NULL)
-		{
-			/*
-			 * We ran off the end of the bucket without finding a match.
-			 * Release the pin on bucket buffers.  Normally, such pins are
-			 * released at end of scan, however scrolling cursors can
-			 * reacquire the bucket lock and pin in the same scan multiple
-			 * times.
-			 */
-			*bufP = so->hashso_curbuf = InvalidBuffer;
-			ItemPointerSetInvalid(current);
-			_hash_dropscanbuf(rel, so);
-			return false;
-		}
-
-		/* check the tuple quals, loop around if not met */
-	} while (!_hash_checkqual(scan, itup));
-
-	/* if we made it to here, we've found a valid tuple */
-	blkno = BufferGetBlockNumber(buf);
-	*bufP = so->hashso_curbuf = buf;
-	ItemPointerSet(current, blkno, offnum);
-	return true;
-}
-
-/*
  *	_hash_readpage() -- Load data from current index page into so->currPos
  *
  *	We scan all the items in the current index page and save them into
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 4efed52..4056da5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -150,14 +150,6 @@ typedef struct HashScanOpaqueData
 	/* Hash value of the scan key, ie, the hash key we seek */
 	uint32		hashso_sk_hash;
 
-	/*
-	 * We also want to remember which buffer we're currently examining in the
-	 * scan. We keep the buffer pinned (but not locked) across hashgettuple
-	 * calls, in order to avoid doing a ReadBuffer() for every tuple in the
-	 * index.
-	 */
-	Buffer		hashso_curbuf;
-
 	/* remember the buffer associated with primary bucket */
 	Buffer		hashso_bucket_buf;
 
@@ -168,12 +160,6 @@ typedef struct HashScanOpaqueData
 	 */
 	Buffer		hashso_split_bucket_buf;
 
-	/* Current position of the scan, as an index TID */
-	ItemPointerData hashso_curpos;
-
-	/* Current position of the scan, as a heap TID */
-	ItemPointerData hashso_heappos;
-
 	/* Whether scan starts on bucket being populated due to split */
 	bool		hashso_buc_populated;
 
@@ -409,7 +395,6 @@ extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 /* hashsearch.c */
 extern bool _hash_next(IndexScanDesc scan, ScanDirection dir);
 extern bool _hash_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
-- 
1.8.3.1

From 21894a693904a6ec270906fee403880768ef3db5 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Sun, 2 Apr 2017 03:43:20 +0530
Subject: [PATCH] Improve locking startegy during VACUUM in Hash Index v3

---
 src/backend/access/hash/README     |  2 +-
 src/backend/access/hash/hash.c     | 21 ++++++++++-----------
 src/backend/access/hash/hashovfl.c |  4 +---
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 063656d..a3f2445 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -380,8 +380,8 @@ The fourth operation is garbage collection (bulk deletion):
 			mark the target page dirty
 			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
-			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
+			pin and x-lock next page
 		if the page we have locked is not the primary bucket page:
 			release lock and take exclusive lock on primary bucket page
 		if there are no other pins on the primary bucket page:
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 582163a..4c82868 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -670,11 +670,9 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * that the next valid TID will be greater than or equal to the current
  * valid TID.  There can't be any concurrent scans in progress when we first
  * enter this function because of the cleanup lock we hold on the primary
- * bucket page, but as soon as we release that lock, there might be.  We
- * handle that by conspiring to prevent those scans from passing our cleanup
- * scan.  To do that, we lock the next page in the bucket chain before
- * releasing the lock on the previous page.  (This type of lock chaining is
- * not ideal, so we might want to look for a better solution at some point.)
+ * bucket page, but as soon as we release that lock, there might be. But,
+ * we do not have to bother about it, as the hash index scan work in page
+ * at a time mode.
  *
  * We need to retain a pin on the primary bucket to ensure that no concurrent
  * split can start.
@@ -843,19 +841,20 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		if (!BlockNumberIsValid(blkno))
 			break;
 
-		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-											  LH_OVERFLOW_PAGE,
-											  bstrategy);
-
 		/*
-		 * release the lock on previous page after acquiring the lock on next
-		 * page
+		 * As the hash index scan work in page at a time mode,
+		 * vacuum can release the lock on previous page before
+		 * acquiring lock on the next page.
 		 */
 		if (retain_pin)
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
+		next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+											  LH_OVERFLOW_PAGE,
+											  bstrategy);
+
 		buf = next_buf;
 	}
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..dc119a3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -778,9 +778,7 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
  *	Caller must acquire cleanup lock on the primary page of the target
  *	bucket to exclude any scans that are in progress, which could easily
  *	be confused into returning the same tuple more than once or some tuples
- *	not at all by the rearrangement we are performing here.  To prevent
- *	any concurrent scan to cross the squeeze scan we use lock chaining
- *	similar to hasbucketcleanup.  Refer comments atop hashbucketcleanup.
+ *	not at all by the rearrangement we are performing here.
  *
  *	We need to retain a pin on the primary bucket to ensure that no concurrent
  *	split can start.
-- 
1.8.3.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to