On 30.06.2011 07:50, Jeff Janes wrote:
My concern is that I am unable to prove to myself simply by reading
the code that the 24 line chunk deleted from gistFindPath (near ***
919,947 ****) are no longer needed.  My familiarity with the gist code
is low enough that it is not surprising that I cannot prove this to
myself from first principles.  I have no reason to believe it is not
correct, it is just that I can't convince myself that it is correct.

This is the piece of code we're talking about:

***************
*** 919,947 **** gistFindPath(Relation r, BlockNumber child)
                        blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
                        if (blkno == child)
                        {
-                               OffsetNumber poff = InvalidOffsetNumber;
-
-                               /* make childs links */
-                               ptr = top;
-                               while (ptr->parent)
-                               {
-                                       /* move childoffnum.. */
-                                       if (ptr == top)
-                                       {
-                                               /* first iteration */
-                                               poff = ptr->parent->childoffnum;
-                                               ptr->parent->childoffnum = 
ptr->childoffnum;
-                                       }
-                                       else
-                                       {
-                                               OffsetNumber tmp = 
ptr->parent->childoffnum;
-
-                                               ptr->parent->childoffnum = poff;
-                                               poff = tmp;
-                                       }
-                                       ptr = ptr->parent;
-                               }
-                               top->childoffnum = i;
                                UnlockReleaseBuffer(buffer);
                                return top;
                        }

Now that I look closer at the patch, I think it's in fact incorrect. The removed code used to store the offset of the downlink in the direct parent of the child that was searched for, in top->childoffnum. That's the last removed line: top->childoffnum = i. With the patch, that is stored nowhere. gistFindPath() needs to return it somehow, so that it gets updated in the stack returned by gistFindCorrectParent.

Attached is a modified patch that fixes that. I couldn't resist some cosmetic changes along the way, sorry about that. I made gistFindPath use a regular List instead of carrying the extra 'next' field in GISTInsertStack. That seems much cleaner as that field is only needed for local storage in the highly unlikely case that gistFindPath() is called. I also made the error cases use elog() instead of assertions.

So I tried provoking situations where this surrounding code section
would get executed, both patched and unpatched.  I have been unable to
do so--apparently this code is for an incredibly obscure situation
which I can't induce at will.

You'll need a concurrent split of the root page, while you're splitting a page at some lower level. For example:

    R
L1     L2

R is the root page, and L1 and L2 are leaf pages. Now, imagine that you insert a tuple into L2, causing it to split into pages L2* and L3. Your insertion stack looks like R->L2. Before you have a chance to insert the downlink for L3 into R, someone else splits the root page:

      R
  I1         I2
L1 L3      L2* L3

The new parent of L2 is the new internal page I2, but gistFindCorrectParent() will never visit that page. The insertion stack is R->L2, so gistFindCorrectParent() will only search R, and won't find the downlink for L2 there anymore.

The only practical way to test that is to use a debugger or add some debugging statements to the code. Here's what I did:

1. Create a test table and gist index:

CREATE TABLE gisttest (p point);
CREATE INDEX i_gisttest ON gisttest USING gist (p)

2. Insert some test data. Use two different values so that you can conveniently later insert into distinct branches of the gist tree.

INSERT INTO gisttest SELECT point(1,1) FROM generate_series(1,1000); INSERT INTO gisttest SELECT point(10,10) FROM generate_series(1,1000);

3. Attach a debugger to the backend process, and create a couple of breakpoints:

(gdb) break gistSplit
Breakpoint 1 at 0x46ace1: file gist.c, line 1295.
(gdb) break gistFindPath
Breakpoint 2 at 0x469740: file gist.c, line 884.
(gdb) cont
Continuing.

4. Insert some more tuples to the table using the debugged backend:

INSERT INTO gisttest SELECT point(1,1) FROM generate_series(1,1000);

This hits the breakpoint at gistSplit:

Breakpoint 1, gistSplit (r=0x7f84f4bad328, page=0x7f84f1b8b180 "",
    itup=0x2032118, len=186, giststate=0x7fff8615e9e0) at gist.c:1295
1295            SplitedPageLayout *res = NULL;

5. The backend is now in the middle of splitting a leaf page. Now we need to make its GISTInsertStack obsolete by concurrently splitting the root page. Open another psql session, and insert more data elsewhere in the gist index, to cause the root to split:

postgres=# INSERT INTO gisttest SELECT point(10,10) FROM generate_series(1,100000);
INSERT 0 100000

6. You can now let the first backend continue. It will hit the breakpoint in gistFindPath():

(gdb) cont
Continuing.

Breakpoint 2, gistFindPath (r=0x7f84f4bad328, child=6,
    newchildoffnum=0x20320d8) at gist.c:884
884             top = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));



BTW, the b-tree code deals with this scenario slightly differently. Splitting the root in b-tree splits the root page like any other page, and creates a new root page on a different block, while in GiST the root page is always at block number 0, and root split moves all the existing tuples on the root page to different blocks. Correspondingly, the code in b-tree to re-find the parent of a page also works slightly differently. The b-tree re-find algorithm just moves right until it finds the new parent. It will always find the parent, because it can only have moved right at the same level it used to be. However, in b-tree it's possible that the page that used to be the root page is not the root anymore. In that case the b-tree code does something similar to gistFindPath(), and starts scanning from the leftmost page at the right level. Search for "concurrent ROOT page split" in nbtinsert.c to find that code.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b756f6e..9a33597 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -59,6 +59,10 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 				GISTSTATE *giststate, List *splitinfo);
 
+static void gistFindCorrectParent(Relation r, GISTInsertStack *child);
+static GISTInsertStack *gistFindPath(Relation r, BlockNumber child,
+									 OffsetNumber *newchildoffnum);
+
 
 #define ROTATEDIST(d) do { \
 	SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
@@ -626,6 +630,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 	firststack.blkno = GIST_ROOT_BLKNO;
 	firststack.lsn.xrecoff = 0;
 	firststack.parent = NULL;
+	firststack.childoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
 
 	/*
@@ -702,9 +707,10 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			BlockNumber childblkno;
 			IndexTuple	newtup;
 			GISTInsertStack *item;
+			OffsetNumber childoffnum;
 
-			stack->childoffnum = gistchoose(state.r, stack->page, itup, giststate);
-			iid = PageGetItemId(stack->page, stack->childoffnum);
+			childoffnum = gistchoose(state.r, stack->page, itup, giststate);
+			iid = PageGetItemId(stack->page, childoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
 			childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 
@@ -754,7 +760,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				 * tuple.
 				 */
 				if (gistinserttuples(&state, stack, giststate, &newtup, 1,
-									 stack->childoffnum, InvalidBuffer))
+									 childoffnum, InvalidBuffer))
 				{
 					/*
 					 * If this was a root split, the root page continues to be
@@ -778,6 +784,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			item = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 			item->blkno = childblkno;
 			item->parent = stack;
+			item->childoffnum = childoffnum;
 			state.stack = stack = item;
 		}
 		else
@@ -852,14 +859,16 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 }
 
 /*
- * Traverse the tree to find path from root page to specified "child" block.
+ * Traverse the tree to find path from root page to specified block.
  *
- * returns from the beginning of closest parent;
+ * returns a new insertion stack, starting from the parent of 'child', up
+ * to the root. *newchildoffnum is set to the offset number of the downlink
+ * in the direct parent of child that points to the child.
  *
  * To prevent deadlocks, this should lock only one page at a time.
  */
 GISTInsertStack *
-gistFindPath(Relation r, BlockNumber child)
+gistFindPath(Relation r, BlockNumber child, OffsetNumber *newchildoffnum)
 {
 	Page		page;
 	Buffer		buffer;
@@ -867,16 +876,22 @@ gistFindPath(Relation r, BlockNumber child)
 				maxoff;
 	ItemId		iid;
 	IndexTuple	idxtuple;
+	List	   *fifo;
 	GISTInsertStack *top,
-			   *tail,
 			   *ptr;
 	BlockNumber blkno;
 
-	top = tail = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
+	top = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 	top->blkno = GIST_ROOT_BLKNO;
+	top->childoffnum = InvalidOffsetNumber;
 
-	while (top && top->blkno != child)
+	fifo = list_make1(top);
+	while (fifo != NIL)
 	{
+		/* Get next page from the fifo */
+		top = linitial(fifo);
+		fifo = list_delete_first(fifo);
+
 		buffer = ReadBuffer(r, top->blkno);
 		LockBuffer(buffer, GIST_SHARE);
 		gistcheckpage(r, buffer);
@@ -884,9 +899,12 @@ gistFindPath(Relation r, BlockNumber child)
 
 		if (GistPageIsLeaf(page))
 		{
-			/* we can safety go away, follows only leaf pages */
+			/*
+			 * we can stop looking, there are only leaf pages left in the
+			 * stack.
+			 */
 			UnlockReleaseBuffer(buffer);
-			return NULL;
+			break;
 		}
 
 		top->lsn = PageGetLSN(page);
@@ -901,14 +919,18 @@ gistFindPath(Relation r, BlockNumber child)
 		if (top->parent && XLByteLT(top->parent->lsn, GistPageGetOpaque(page)->nsn) &&
 			GistPageGetOpaque(page)->rightlink != InvalidBlockNumber /* sanity check */ )
 		{
-			/* page splited while we thinking of... */
+			/*
+			 * Page was split while we were looking elsewhere. We didn't see
+			 * the downlink to page to the right of this one, when we
+			 * processed the parent level. Append it to the list of pages to
+			 * visit later.
+			 */
 			ptr = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 			ptr->blkno = GistPageGetOpaque(page)->rightlink;
 			ptr->childoffnum = InvalidOffsetNumber;
 			ptr->parent = top;
-			ptr->next = NULL;
-			tail->next = ptr;
-			tail = ptr;
+
+			fifo = lappend(fifo, ptr);
 		}
 
 		maxoff = PageGetMaxOffsetNumber(page);
@@ -920,50 +942,27 @@ gistFindPath(Relation r, BlockNumber child)
 			blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 			if (blkno == child)
 			{
-				OffsetNumber poff = InvalidOffsetNumber;
-
-				/* make childs links */
-				ptr = top;
-				while (ptr->parent)
-				{
-					/* move childoffnum.. */
-					if (ptr == top)
-					{
-						/* first iteration */
-						poff = ptr->parent->childoffnum;
-						ptr->parent->childoffnum = ptr->childoffnum;
-					}
-					else
-					{
-						OffsetNumber tmp = ptr->parent->childoffnum;
-
-						ptr->parent->childoffnum = poff;
-						poff = tmp;
-					}
-					ptr = ptr->parent;
-				}
-				top->childoffnum = i;
+				/* Found it! */
 				UnlockReleaseBuffer(buffer);
-				return top;
+				*newchildoffnum = blkno;
+				return ptr;
 			}
 			else
 			{
-				/* Install next inner page to the end of stack */
+				/* Append this child to the list of page to visit later */
 				ptr = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 				ptr->blkno = blkno;
-				ptr->childoffnum = i;	/* set offsetnumber of child to child
-										 * !!! */
+				ptr->childoffnum = i;
 				ptr->parent = top;
-				ptr->next = NULL;
-				tail->next = ptr;
-				tail = ptr;
+
+				fifo = lappend(fifo, ptr);
 			}
 		}
 
 		UnlockReleaseBuffer(buffer);
-		top = top->next;
 	}
 
+	elog(ERROR, "failed to re-find parent of gist page %u", child);
 	return NULL;
 }
 
@@ -981,7 +980,8 @@ gistFindCorrectParent(Relation r, GISTInsertStack *child)
 	parent->page = (Page) BufferGetPage(parent->buffer);
 
 	/* here we don't need to distinguish between split and page update */
-	if (parent->childoffnum == InvalidOffsetNumber || !XLByteEQ(parent->lsn, PageGetLSN(parent->page)))
+	if (child->childoffnum == InvalidOffsetNumber ||
+		!XLByteEQ(parent->lsn, PageGetLSN(parent->page)))
 	{
 		/* parent is changed, look child in right links until found */
 		OffsetNumber i,
@@ -1000,7 +1000,7 @@ gistFindCorrectParent(Relation r, GISTInsertStack *child)
 				if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
 				{
 					/* yes!!, found */
-					parent->childoffnum = i;
+					child->childoffnum = i;
 					return;
 				}
 			}
@@ -1034,7 +1034,7 @@ gistFindCorrectParent(Relation r, GISTInsertStack *child)
 		}
 
 		/* ok, find new path */
-		ptr = parent = gistFindPath(r, child->blkno);
+		ptr = parent = gistFindPath(r, child->blkno, &child->childoffnum);
 		Assert(ptr != NULL);
 
 		/* read all buffers as expected by caller */
@@ -1104,7 +1104,7 @@ gistformdownlink(Relation rel, Buffer buf, GISTSTATE *giststate,
 
 		LockBuffer(stack->parent->buffer, GIST_EXCLUSIVE);
 		gistFindCorrectParent(rel, stack);
-		iid = PageGetItemId(stack->parent->page, stack->parent->childoffnum);
+		iid = PageGetItemId(stack->parent->page, stack->childoffnum);
 		downlink = (IndexTuple) PageGetItem(stack->parent->page, iid);
 		downlink = CopyIndexTuple(downlink);
 		LockBuffer(stack->parent->buffer, GIST_UNLOCK);
@@ -1132,7 +1132,7 @@ gistfixsplit(GISTInsertState *state, GISTSTATE *giststate)
 		 RelationGetRelationName(state->r), stack->blkno);
 
 	Assert(GistFollowRight(stack->page));
-	Assert(OffsetNumberIsValid(stack->parent->childoffnum));
+	Assert(OffsetNumberIsValid(stack->childoffnum));
 
 	buf = stack->buffer;
 
@@ -1269,7 +1269,7 @@ gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 	tuples[1] = right->downlink;
 	gistinserttuples(state, stack->parent, giststate,
 					 tuples, 2,
-					 stack->parent->childoffnum,
+					 stack->childoffnum,
 					 left->buf);
 	LockBuffer(stack->parent->buffer, GIST_UNLOCK);
 	UnlockReleaseBuffer(right->buf);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 77e3cb5..ca9fd25 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -223,9 +223,6 @@ typedef struct GISTInsertStack
 
 	/* pointer to parent */
 	struct GISTInsertStack *parent;
-
-	/* for gistFindPath */
-	struct GISTInsertStack *next;
 } GISTInsertStack;
 
 typedef struct GistSplitVector
@@ -293,8 +290,6 @@ extern void freeGISTstate(GISTSTATE *giststate);
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
-extern GISTInsertStack *gistFindPath(Relation r, BlockNumber child);
-
 /* gistxlog.c */
 extern void gist_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to