On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):

"""
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
"""

I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.

Yes, that's correct, these b-tree patches are part of a grand plan to eliminate post-recovery cleanup actions altogether.

So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.

I think you should bump XLOG_PAGE_MAGIC.

Perhaps you should increase the elevel here:

+       elog(DEBUG1, "finishing incomplete split of %u/%u",
+                BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));

Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.

Hmm. I'm not sure I agree with that line of thinking in general - what is an admin supposed to do about the message? But there's a precedent in vacuumlazy.c, which logs a message when it finds all-zero pages in the heap. So I guess that should be a LOG.

Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.

Yeah, reverted.

I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.

Yep, fixed.

Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().

Hmm, no, the other callers look OK to me.

I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that should either be prominently documented, or just not occur.
These interactions are complex enough that we ought to be very
explicit about where pins are required and dropped, or locks held
before or after calling.

I haven't looked into this in detail yet, but I admit I don't much like the _bt_moveright() function signature. It's strange to pass 'access' and 'forupdate' as separate arguments, and it's not totally clear what combinations make sense and what they mean. Some kind of refactoring is probably in order.

Here's a new version, rebased over the latest page-deletion patch, fix-btree-page-deletion-3.patch (http://www.postgresql.org/message-id/52e66e84.2050...@vmware.com). I haven't tested this extensively, but passes "make check"...

- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 03efc29..43ee75f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -404,12 +404,34 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a "new root" entry rather than
 an "insertion" entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-they-fly, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra updates in what would otherwise
+be a read-only operation (updating is not possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in the
+parent is finished and the flag in the child cleared, but can be released
+immediately after that, before recursing up the tree, if the parent also
+needs to be split. This ensures that incompletely split pages should not be
+seen under normal circumstances; only when insertion to the parent fails
+for some reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the "fast root" link) is performed and logged as part
@@ -422,6 +444,14 @@ page is a second record.  If vacuum is interrupted for some reason, or the
 system crashes, the tree is consistent for searches and insertions.  The next
 VACUUM will find the half-dead leaf page and continue the deletion.
 
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next  insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
 Scans during Recovery
 ---------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 49c084a..4d60ae0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -58,15 +58,18 @@ static void _bt_findinsertloc(Relation rel,
 				  int keysz,
 				  ScanKey scankey,
 				  IndexTuple newtup,
+				  BTStack stack,
 				  Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
 		  IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
@@ -130,7 +133,8 @@ top:
 	 * move right in the tree.	See Lehman and Yao for an excruciatingly
 	 * precise description.
 	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+						true, stack, BT_WRITE);
 
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
@@ -183,8 +187,9 @@ top:
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
-		_bt_insertonpg(rel, buf, stack, itup, offset, false);
+		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
+						  stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -508,6 +513,7 @@ _bt_findinsertloc(Relation rel,
 				  int keysz,
 				  ScanKey scankey,
 				  IndexTuple newtup,
+				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
@@ -569,6 +575,7 @@ _bt_findinsertloc(Relation rel,
 	while (PageGetFreeSpace(page) < itemsz)
 	{
 		Buffer		rbuf;
+		BlockNumber	rblkno;
 
 		/*
 		 * before considering moving right, see if we can obtain enough space
@@ -606,18 +613,33 @@ _bt_findinsertloc(Relation rel,
 		 */
 		rbuf = InvalidBuffer;
 
+		rblkno = lpageop->btpo_next;
 		for (;;)
 		{
-			BlockNumber rblkno = lpageop->btpo_next;
-
 			rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
 			page = BufferGetPage(rbuf);
 			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+			/*
+			 * If this page was incompletely split, finish that now. We do
+			 * this while holding a lock on the left sibling, which is not
+			 * good because finishing the split could be a fairly lengthy
+			 * operation.  But this should happen very seldom.
+			 */
+			if (P_INCOMPLETE_SPLIT(lpageop))
+			{
+				_bt_finish_split(rel, rbuf, stack);
+				rbuf = InvalidBuffer;
+				continue;
+			}
+
 			if (!P_IGNORE(lpageop))
 				break;
 			if (P_RIGHTMOST(lpageop))
 				elog(ERROR, "fell off the end of index \"%s\"",
 					 RelationGetRelationName(rel));
+
+			rblkno = lpageop->btpo_next;
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
@@ -663,6 +685,10 @@ _bt_findinsertloc(Relation rel,
  *		insertion, and the buffer must be pinned and write-locked.	On return,
  *		we will have dropped both the pin and the lock on the buffer.
  *
+ *		When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ *		page we're inserting the downlink for.  This function will clear the
+ *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
  *		The locking interactions in this code are critical.  You should
  *		grok Lehman and Yao's paper before making any changes.  In addition,
  *		you need to understand how we disambiguate duplicate keys in this
@@ -676,6 +702,7 @@ _bt_findinsertloc(Relation rel,
 static void
 _bt_insertonpg(Relation rel,
 			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
@@ -689,6 +716,14 @@ _bt_insertonpg(Relation rel,
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/*
+	 * The caller should've finished any incomplete splits and page deletions
+	 * already
+	 */
+	if (P_INCOMPLETE_SPLIT(lpageop))
+		elog(ERROR, "cannot insert to incompletely split page %u",
+			 BufferGetBlockNumber(buf));
+
 	itemsz = IndexTupleDSize(*itup);
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
@@ -713,7 +748,7 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, firstright,
+		rbuf = _bt_split(rel, buf, cbuf, firstright,
 						 newitemoff, itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
@@ -787,11 +822,21 @@ _bt_insertonpg(Relation rel,
 			MarkBufferDirty(metabuf);
 		}
 
+		/* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+		if (!P_ISLEAF(lpageop))
+		{
+			Page		cpage = BufferGetPage(cbuf);
+			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+			Assert(P_INCOMPLETE_SPLIT(cpageop));
+			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+			MarkBufferDirty(cbuf);
+		}
+
 		/* XLOG stuff */
 		if (RelationNeedsWAL(rel))
 		{
 			xl_btree_insert xlrec;
-			BlockNumber xldownlink;
+			BlockNumber xlleftchild;
 			xl_btree_metadata xlmeta;
 			uint8		xlinfo;
 			XLogRecPtr	recptr;
@@ -811,12 +856,15 @@ _bt_insertonpg(Relation rel,
 				xlinfo = XLOG_BTREE_INSERT_LEAF;
 			else
 			{
-				xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
-				Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
-				nextrdata->data = (char *) &xldownlink;
+				/*
+				 * Include the block number of the left child, whose
+				 * INCOMPLETE_SPLIT flag is cleared.
+				 */
+				xlleftchild = BufferGetBlockNumber(cbuf);
+				nextrdata->data = (char *) &xlleftchild;
 				nextrdata->len = sizeof(BlockNumber);
-				nextrdata->buffer = InvalidBuffer;
+				nextrdata->buffer = cbuf;
+				nextrdata->buffer_std = true;
 				nextrdata->next = nextrdata + 1;
 				nextrdata++;
 
@@ -869,6 +917,8 @@ _bt_insertonpg(Relation rel,
 		END_CRIT_SECTION();
 
 		/* release buffers; send out relcache inval if metapage changed */
+		if (!P_ISLEAF(lpageop))
+			_bt_relbuf(rel, cbuf);
 		if (BufferIsValid(metabuf))
 		{
 			if (!InRecovery)
@@ -888,11 +938,15 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
+ *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ *		page we're inserting the downlink for. This function will clear the
+ *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
 		  bool newitemonleft)
 {
@@ -960,6 +1014,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	lopaque->btpo_flags = oopaque->btpo_flags;
 	lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_flags = lopaque->btpo_flags;
+	/* set flag in left page indicating that the right page has no downlink */
+	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	lopaque->btpo_next = rightpagenumber;
 	ropaque->btpo_prev = origpagenumber;
@@ -1183,6 +1239,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		MarkBufferDirty(sbuf);
 	}
 
+	/*
+	 * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+	 * a split.
+	 */
+	if (ropaque->btpo.level > 0)
+	{
+		Page		cpage = BufferGetPage(cbuf);
+		BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+		cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+		MarkBufferDirty(cbuf);
+	}
+
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
@@ -1205,34 +1273,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 
 		lastrdata = &rdata[0];
 
-		if (ropaque->btpo.level > 0)
-		{
-			/* Log downlink on non-leaf pages */
-			lastrdata->next = lastrdata + 1;
-			lastrdata++;
-
-			lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
-			lastrdata->len = sizeof(BlockIdData);
-			lastrdata->buffer = InvalidBuffer;
-
-			/*
-			 * We must also log the left page's high key, because the right
-			 * page's leftmost key is suppressed on non-leaf levels.  Show it
-			 * as belonging to the left page buffer, so that it is not stored
-			 * if XLogInsert decides it needs a full-page image of the left
-			 * page.
-			 */
-			lastrdata->next = lastrdata + 1;
-			lastrdata++;
-
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			lastrdata->data = (char *) item;
-			lastrdata->len = MAXALIGN(IndexTupleSize(item));
-			lastrdata->buffer = buf;	/* backup block 1 */
-			lastrdata->buffer_std = true;
-		}
-
 		/*
 		 * Log the new item and its offset, if it was inserted on the left
 		 * page. (If it was put on the right page, we don't need to explicitly
@@ -1259,17 +1299,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 			lastrdata->buffer = buf;	/* backup block 1 */
 			lastrdata->buffer_std = true;
 		}
-		else if (ropaque->btpo.level == 0)
+
+		/* Log left page */
+		if (ropaque->btpo.level > 0)
 		{
+			lastrdata->next = lastrdata + 1;
+			lastrdata++;
+
 			/*
-			 * Although we don't need to WAL-log the new item, we still need
-			 * XLogInsert to consider storing a full-page image of the left
-			 * page, so make an empty entry referencing that buffer. This also
-			 * ensures that the left page is always backup block 1.
+			 * We must also log the left page's high key, because the right
+			 * page's leftmost key is suppressed on non-leaf levels.  Show it
+			 * as belonging to the left page buffer, so that it is not stored
+			 * if XLogInsert decides it needs a full-page image of the left
+			 * page.
 			 */
+			itemid = PageGetItemId(origpage, P_HIKEY);
+			item = (IndexTuple) PageGetItem(origpage, itemid);
+			lastrdata->data = (char *) item;
+			lastrdata->len = MAXALIGN(IndexTupleSize(item));
+			lastrdata->buffer = buf;	/* backup block 1 */
+			lastrdata->buffer_std = true;
+		}
+
+		if (ropaque->btpo.level == 0 && !newitemonleft)
+		{
 			lastrdata->next = lastrdata + 1;
 			lastrdata++;
 
+			/*
+			 * Although we don't need to WAL-log anything on the left page,
+			 * the new item, we still need XLogInsert to consider storing a
+			 * full-page image of the left page, so make an empty entry
+			 * referencing that buffer. This also ensures that the left page
+			 * is always backup block 1.
+			 */
 			lastrdata->data = NULL;
 			lastrdata->len = 0;
 			lastrdata->buffer = buf;	/* backup block 1 */
@@ -1277,6 +1340,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		}
 
 		/*
+		 * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+		 * insertion clears.
+		 */
+		if (ropaque->btpo.level > 0)
+		{
+			BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+			lastrdata->next = lastrdata + 1;
+			lastrdata++;
+
+			lastrdata->data = (char *) &cblkno;
+			lastrdata->len = sizeof(BlockNumber);
+			lastrdata->buffer = cbuf;	/* backup block 2 */
+			lastrdata->buffer_std = true;
+		}
+
+		/*
 		 * Log the contents of the right page in the format understood by
 		 * _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
 		 * because we're going to recreate the whole page anyway, so it should
@@ -1305,7 +1384,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 
 			lastrdata->data = NULL;
 			lastrdata->len = 0;
-			lastrdata->buffer = sbuf;	/* backup block 2 */
+			lastrdata->buffer = sbuf;	/* bkp block 2 (leaf) or 3 (non-leaf) */
 			lastrdata->buffer_std = true;
 		}
 
@@ -1332,6 +1411,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	if (!P_RIGHTMOST(ropaque))
 		_bt_relbuf(rel, sbuf);
 
+	/* release the child */
+	if (ropaque->btpo.level > 0)
+		_bt_relbuf(rel, cbuf);
+
 	/* split's done */
 	return rbuf;
 }
@@ -1602,10 +1685,8 @@ _bt_checksplitloc(FindSplitData *state,
  *			have to be efficient (concurrent ROOT split, WAL recovery)
  * is_root - we split the true root
  * is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
  */
-void
+static void
 _bt_insert_parent(Relation rel,
 				  Buffer buf,
 				  Buffer rbuf,
@@ -1624,8 +1705,7 @@ _bt_insert_parent(Relation rel,
 	 *
 	 * If we have to search for the parent level, we do so by re-descending
 	 * from the root.  This is not super-efficient, but it's rare enough not
-	 * to matter.  (This path is also taken when called from WAL recovery ---
-	 * we have no stack in that case.)
+	 * to matter.
 	 */
 	if (is_root)
 	{
@@ -1684,12 +1764,13 @@ _bt_insert_parent(Relation rel,
 		 * 05/27/97
 		 */
 		ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
 		pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
 
-		/* Now we can unlock the children */
+		/*
+		 * Now we can unlock the right child. The left child will be unlocked
+		 * by _bt_insertonpg().
+		 */
 		_bt_relbuf(rel, rbuf);
-		_bt_relbuf(rel, buf);
 
 		/* Check for error only after writing children */
 		if (pbuf == InvalidBuffer)
@@ -1697,7 +1778,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, stack->bts_parent,
+		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -1707,6 +1788,62 @@ _bt_insert_parent(Relation rel,
 }
 
 /*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * A crash or other failure can leave a split incomplete. The insertion
+ * routines won't allow to insert on a page that is incompletely split.
+ * Before inserting on such a page, call _bt_finish_split().
+ *
+ * On entry, we hold write-mode lock on it, and the lock is released on
+ * exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+	Page		lpage = BufferGetPage(lbuf);
+	BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+	Buffer		rbuf;
+	Page		rpage;
+	BTPageOpaque rpageop;
+	bool		was_root;
+	bool		was_only;
+
+	Assert(P_INCOMPLETE_SPLIT(lpageop));
+
+	/* Lock right sibling, the one missing the downlink */
+	rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+	rpage = BufferGetPage(rbuf);
+	rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+	/* Could this be a root split? */
+	if (!stack)
+	{
+		Buffer		metabuf;
+		Page		metapg;
+		BTMetaPageData *metad;
+
+		/* acquire lock on the metapage */
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+		metapg = BufferGetPage(metabuf);
+		metad = BTPageGetMeta(metapg);
+
+		was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+		_bt_relbuf(rel, metabuf);
+	}
+	else
+		was_root = false;
+
+	/* Was this the only page on the level before split? */
+	was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+	elog(DEBUG1, "finishing incomplete split of %u/%u",
+		 BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+	_bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
  *	_bt_getstackbuf() -- Walk back up the tree one step, and find the item
  *						 we last looked at in the parent.
  *
@@ -1738,6 +1875,12 @@ _bt_getstackbuf(Relation rel, BTStack stack, int access)
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+		if (access == BT_WRITE && P_INCOMPLETE_SPLIT(opaque))
+		{
+			_bt_finish_split(rel, buf, stack->bts_parent);
+			continue;
+		}
+
 		if (!P_IGNORE(opaque))
 		{
 			OffsetNumber offnum,
@@ -1842,6 +1985,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 				rbkno;
 	BlockNumber rootblknum;
 	BTPageOpaque rootopaque;
+	BTPageOpaque lopaque;
 	ItemId		itemid;
 	IndexTuple	item;
 	Size		itemsz;
@@ -1853,6 +1997,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	lbkno = BufferGetBlockNumber(lbuf);
 	rbkno = BufferGetBlockNumber(rbuf);
 	lpage = BufferGetPage(lbuf);
+	lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 
 	/* get a new root page */
 	rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1926,6 +2071,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 			 BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
 	pfree(new_item);
 
+	/* Clear the incomplete-split flag in the left child */
+	Assert(P_INCOMPLETE_SPLIT(lopaque));
+	lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+	MarkBufferDirty(lbuf);
+
 	MarkBufferDirty(rootbuf);
 	MarkBufferDirty(metabuf);
 
@@ -1934,7 +2084,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	{
 		xl_btree_newroot xlrec;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[2];
+		XLogRecData rdata[3];
 
 		xlrec.node = rel->rd_node;
 		xlrec.rootblk = rootblknum;
@@ -1953,7 +2103,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		rdata[1].len = ((PageHeader) rootpage)->pd_special -
 			((PageHeader) rootpage)->pd_upper;
 		rdata[1].buffer = InvalidBuffer;
-		rdata[1].next = NULL;
+		rdata[1].next = &(rdata[2]);
+
+		/* Make a full-page image of the left child if needed */
+		rdata[2].data = NULL;
+		rdata[2].len = 0;
+		rdata[2].buffer = lbuf;
+		rdata[2].next = NULL;
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7ccbfd3..6bf1341 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1026,8 +1026,13 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 			 * It's only child, so safe if parent would itself be removable.
 			 * We have to check the parent itself, and then recurse to test
 			 * the conditions at the parent's parent.
+			 *
+			 * Also check that the page, or its left sibling, isn't
+			 * incompletely split.  These are the same checks that we already
+			 * did for the leaf page, see comments in _bt_mark_page_hlfdead().
 			 */
-			if (P_RIGHTMOST(opaque) || P_ISROOT(opaque))
+			if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
+				P_INCOMPLETE_SPLIT(opaque))
 			{
 				_bt_relbuf(rel, pbuf);
 				return false;
@@ -1128,7 +1133,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 		 * check that page is not already deleted and is empty.
 		 */
 		if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-			P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+			P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+			P_INCOMPLETE_SPLIT(opaque))
 		{
 			/* Should never fail to delete a half-dead page */
 			Assert(!P_ISHALFDEAD(opaque));
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index bc62fbc..d4220d9 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -51,7 +51,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
  * NOTE that the returned buffer is read-locked regardless of the access
  * parameter.  However, access = BT_WRITE will allow an empty root page
  * to be created and returned.	When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer.  Also, in BT_WRITE mode,
+ * any incomplete splits or half-dead pages encountered during the search
+ * will be finished.
  */
 BTStack
 _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +84,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Race -- the page we just grabbed may have split since we read its
 		 * pointer in the parent (or metapage).  If it has, we may need to
 		 * move right to its new sibling.  Do that.
+		 *
+		 * In write-mode, allow _bt_moveright to finish any incomplete splits
+		 * along the way.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+							  (access == BT_WRITE), stack_in,
+							  BT_READ);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -148,6 +155,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * item >= scankey.  When nextkey is true, we are looking for the first
  * item strictly greater than scankey.
  *
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when searching for a target page for
+ * an insertion, because we don't allow inserting on a page with incomplete
+ * actions. 'stack' is only used if forupdate is true.
+ *
  * On entry, we have the buffer pinned and a lock of the type specified by
  * 'access'.  If we move right, we release the buffer and lock and acquire
  * the same on the right sibling.  Return value is the buffer we stop at.
@@ -158,15 +170,14 @@ _bt_moveright(Relation rel,
 			  int keysz,
 			  ScanKey scankey,
 			  bool nextkey,
+			  bool forupdate,
+			  BTStack stack,
 			  int access)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	int32		cmpval;
 
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
@@ -184,16 +195,48 @@ _bt_moveright(Relation rel,
 	 */
 	cmpval = nextkey ? 0 : 1;
 
-	while (!P_RIGHTMOST(opaque) &&
-		   (P_IGNORE(opaque) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
+	for (;;)
 	{
-		/* step right one page */
-		BlockNumber rblkno = opaque->btpo_next;
-
-		buf = _bt_relandgetbuf(rel, buf, rblkno, access);
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		if (P_RIGHTMOST(opaque))
+			break;
+
+		/*
+		 * Finish any incomplete splits we encounter along the way.
+		 */
+		if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+		{
+			BlockNumber blkno = BufferGetBlockNumber(buf);
+
+			if (access == BT_READ)
+			{
+				/* upgrade the lock, and re-check */
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				LockBuffer(buf, BT_WRITE);
+				if (P_INCOMPLETE_SPLIT(opaque))
+				{
+					_bt_finish_split(rel, buf, stack);
+					_bt_getbuf(rel, blkno, access);
+				}
+			}
+			else
+			{
+				_bt_finish_split(rel, buf, stack);
+				_bt_getbuf(rel, blkno, access);
+			}
+			continue;
+		}
+
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		{
+			/* step right one page */
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+		else
+			break;
 	}
 
 	if (P_IGNORE(opaque))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index e7317c4..8409bad 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,61 +21,6 @@
 #include "miscadmin.h"
 
 /*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay.  This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split to remain incomplete for long.  In any
- * case we need to respect the order of operations.
- */
-typedef struct bt_incomplete_split
-{
-	RelFileNode node;			/* the index */
-	bool		is_root;		/* we split the root */
-	BlockNumber leftblk;		/* left half of split */
-	BlockNumber rightblk;		/* right half of split */
-} bt_incomplete_split;
-
-static List *incomplete_splits;
-
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
-					 BlockNumber rightblk, bool is_root)
-{
-	bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
-
-	action->node = node;
-	action->is_root = is_root;
-	action->leftblk = leftblk;
-	action->rightblk = rightblk;
-	incomplete_splits = lappend(incomplete_splits, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
-	ListCell   *l;
-
-	foreach(l, incomplete_splits)
-	{
-		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
-
-		if (RelFileNodeEquals(node, action->node) &&
-			downlink == action->rightblk)
-		{
-			if (is_root != action->is_root)
-				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
-					 action->is_root, is_root);
-			incomplete_splits = list_delete_ptr(incomplete_splits, action);
-			pfree(action);
-			break;				/* need not look further */
-		}
-	}
-}
-
-/*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
  * The page is freshly init'd, and *from (length len) is a copy of what
@@ -149,23 +94,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
 	UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ */
+static void
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+						   RelFileNode rnode, BlockNumber cblock)
+{
+	Buffer buf;
+
+	buf = XLogReadBuffer(rnode, cblock, false);
+	if (BufferIsValid(buf))
+	{
+		Page		page = (Page) BufferGetPage(buf);
+
+		if (lsn > PageGetLSN(page))
+		{
+			BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+			pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(buf);
+		}
+		UnlockReleaseBuffer(buf);
+	}
+}
+
 static void
 btree_xlog_insert(bool isleaf, bool ismeta,
 				  XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
 	Buffer		buffer;
+	Buffer		cbuffer = InvalidBuffer;
 	Page		page;
 	char	   *datapos;
 	int			datalen;
 	xl_btree_metadata md;
-	BlockNumber downlink = 0;
+	BlockNumber cblkno = 0;
+	int			main_blk_index;
 
 	datapos = (char *) xlrec + SizeOfBtreeInsert;
 	datalen = record->xl_len - SizeOfBtreeInsert;
-	if (!isleaf)
+	/*
+	 * if this insert finishes a split at lower level, extract the block
+	 * number of the (left) child.
+	 */
+	if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
 	{
-		memcpy(&downlink, datapos, sizeof(BlockNumber));
+		memcpy(&cblkno, datapos, sizeof(BlockNumber));
+		Assert(cblkno != 0);
 		datapos += sizeof(BlockNumber);
 		datalen -= sizeof(BlockNumber);
 	}
@@ -176,8 +158,19 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		datalen -= sizeof(xl_btree_metadata);
 	}
 
-	if (record->xl_info & XLR_BKP_BLOCK(0))
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+	if (!isleaf)
+	{
+		if (record->xl_info & XLR_BKP_BLOCK(0))
+			(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		else
+			_bt_clear_incomplete_split(lsn, record, xlrec->target.node, cblkno);
+		main_blk_index = 1;
+	}
+	else
+		main_blk_index = 0;
+
+	if (record->xl_info & XLR_BKP_BLOCK(main_blk_index))
+		(void) RestoreBackupBlock(lsn, record, main_blk_index, false, false);
 	else
 	{
 		buffer = XLogReadBuffer(xlrec->target.node,
@@ -187,11 +180,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		{
 			page = (Page) BufferGetPage(buffer);
 
-			if (lsn <= PageGetLSN(page))
-			{
-				UnlockReleaseBuffer(buffer);
-			}
-			else
+			if (lsn > PageGetLSN(page))
 			{
 				if (PageAddItem(page, (Item) datapos, datalen,
 							ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -200,11 +189,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 
 				PageSetLSN(page, lsn);
 				MarkBufferDirty(buffer);
-				UnlockReleaseBuffer(buffer);
 			}
+			UnlockReleaseBuffer(buffer);
 		}
 	}
 
+	if (BufferIsValid(cbuffer))
+		UnlockReleaseBuffer(cbuffer);
+
 	/*
 	 * Note: in normal operation, we'd update the metapage while still holding
 	 * lock on the page we inserted into.  But during replay it's not
@@ -216,10 +208,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
 		_bt_restore_meta(xlrec->target.node, lsn,
 						 md.root, md.level,
 						 md.fastroot, md.fastlevel);
-
-	/* Forget any split this insertion completes */
-	if (!isleaf)
-		forget_matching_split(xlrec->target.node, downlink, false);
 }
 
 static void
@@ -227,6 +215,8 @@ btree_xlog_split(bool onleft, bool isroot,
 				 XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+	bool		isleaf = (xlrec->level == 0);
+	Buffer		lbuf;
 	Buffer		rbuf;
 	Page		rpage;
 	BTPageOpaque ropaque;
@@ -237,42 +227,18 @@ btree_xlog_split(bool onleft, bool isroot,
 	Size		newitemsz = 0;
 	Item		left_hikey = NULL;
 	Size		left_hikeysz = 0;
+	BlockNumber cblkno = InvalidBlockNumber;
 
 	datapos = (char *) xlrec + SizeOfBtreeSplit;
 	datalen = record->xl_len - SizeOfBtreeSplit;
 
-	/* Forget any split this insertion completes */
-	if (xlrec->level > 0)
-	{
-		/* we assume SizeOfBtreeSplit is at least 16-bit aligned */
-		BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
-		datapos += sizeof(BlockIdData);
-		datalen -= sizeof(BlockIdData);
-
-		forget_matching_split(xlrec->node, downlink, false);
-
-		/* Extract left hikey and its size (still assuming 16-bit alignment) */
-		if (!(record->xl_info & XLR_BKP_BLOCK(0)))
-		{
-			/* We assume 16-bit alignment is enough for IndexTupleSize */
-			left_hikey = (Item) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
-	}
-
-	/* Extract newitem and newitemoff, if present */
+	/* Extract newitemoff and newitem, if present */
 	if (onleft)
 	{
-		/* Extract the offset (still assuming 16-bit alignment) */
 		memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
 		datapos += sizeof(OffsetNumber);
 		datalen -= sizeof(OffsetNumber);
 	}
-
 	if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
 	{
 		/*
@@ -286,6 +252,37 @@ btree_xlog_split(bool onleft, bool isroot,
 		datalen -= newitemsz;
 	}
 
+	/* Extract left hikey and its size (still assuming 16-bit alignment) */
+	if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+	{
+		left_hikey = (Item) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
+	}
+	/*
+	 * If this insertion finishes an incomplete split, get the block number
+	 * of the child.
+	 */
+	if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+	{
+		memcpy(&cblkno, datapos, sizeof(BlockNumber));
+		datapos += sizeof(BlockNumber);
+		datalen -= sizeof(BlockNumber);
+	}
+
+	/*
+	 * Clear the incomplete split flag on the left sibling of the child page
+	 * this is a downlink for.
+	 */
+	if (!isleaf)
+	{
+		if (record->xl_info & XLR_BKP_BLOCK(2))
+			(void) RestoreBackupBlock(lsn, record, 2, false, false);
+		else
+			_bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
+	}
+
 	/* Reconstruct right (new) sibling page from scratch */
 	rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
 	Assert(BufferIsValid(rbuf));
@@ -297,7 +294,7 @@ btree_xlog_split(bool onleft, bool isroot,
 	ropaque->btpo_prev = xlrec->leftsib;
 	ropaque->btpo_next = xlrec->rnext;
 	ropaque->btpo.level = xlrec->level;
-	ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
 	_bt_restore_page(rpage, datapos, datalen);
@@ -306,7 +303,7 @@ btree_xlog_split(bool onleft, bool isroot,
 	 * On leaf level, the high key of the left page is equal to the first key
 	 * on the right page.
 	 */
-	if (xlrec->level == 0)
+	if (isleaf)
 	{
 		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
 
@@ -321,10 +318,10 @@ btree_xlog_split(bool onleft, bool isroot,
 
 	/* Now reconstruct left (original) sibling page */
 	if (record->xl_info & XLR_BKP_BLOCK(0))
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
 	else
 	{
-		Buffer		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+		lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
 
 		if (BufferIsValid(lbuf))
 		{
@@ -381,19 +378,21 @@ btree_xlog_split(bool onleft, bool isroot,
 					elog(PANIC, "failed to add high key to left page after split");
 
 				/* Fix opaque fields */
-				lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+				lopaque->btpo_flags = BTP_INCOMPLETE_SPLIT;
+				if (isleaf)
+					lopaque->btpo_flags |= BTP_LEAF;
 				lopaque->btpo_next = xlrec->rightsib;
 				lopaque->btpo_cycleid = 0;
 
 				PageSetLSN(lpage, lsn);
 				MarkBufferDirty(lbuf);
 			}
-
-			UnlockReleaseBuffer(lbuf);
 		}
 	}
 
-	/* We no longer need the right buffer */
+	/* We no longer need the buffers */
+	if (BufferIsValid(lbuf))
+		UnlockReleaseBuffer(lbuf);
 	UnlockReleaseBuffer(rbuf);
 
 	/*
@@ -404,32 +403,39 @@ btree_xlog_split(bool onleft, bool isroot,
 	 * replay, because no other index update can be in progress, and readers
 	 * will cope properly when following an obsolete left-link.
 	 */
-	if (record->xl_info & XLR_BKP_BLOCK(1))
-		(void) RestoreBackupBlock(lsn, record, 1, false, false);
-	else if (xlrec->rnext != P_NONE)
+	if (xlrec->rnext != P_NONE)
 	{
-		Buffer		buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+		/*
+		 * the backup block containing right sibling is 2 or 3, depending
+		 * whether this was a leaf or internal page.
+		 */
+		int		rnext_index = isleaf ? 2 : 3;
 
-		if (BufferIsValid(buffer))
+		if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+			(void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+		else
 		{
-			Page		page = (Page) BufferGetPage(buffer);
+			Buffer		buffer;
 
-			if (lsn > PageGetLSN(page))
+			buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+			if (BufferIsValid(buffer))
 			{
-				BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+				Page		page = (Page) BufferGetPage(buffer);
 
-				pageop->btpo_prev = xlrec->rightsib;
+				if (lsn > PageGetLSN(page))
+				{
+					BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-				PageSetLSN(page, lsn);
-				MarkBufferDirty(buffer);
+					pageop->btpo_prev = xlrec->rightsib;
+
+					PageSetLSN(page, lsn);
+					MarkBufferDirty(buffer);
+				}
+				UnlockReleaseBuffer(buffer);
 			}
-			UnlockReleaseBuffer(buffer);
 		}
 	}
-
-	/* The job ain't done till the parent link is inserted... */
-	log_incomplete_split(xlrec->node,
-						 xlrec->leftsib, xlrec->rightsib, isroot);
 }
 
 static void
@@ -992,10 +998,7 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque pageop;
-	BlockNumber downlink = 0;
-
-	/* Backup blocks are not used in newroot records */
-	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+	BlockNumber cblkno;
 
 	buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
 	Assert(BufferIsValid(buffer));
@@ -1018,10 +1021,16 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
 		_bt_restore_page(page,
 						 (char *) xlrec + SizeOfBtreeNewroot,
 						 record->xl_len - SizeOfBtreeNewroot);
-		/* extract downlink to the right-hand split page */
-		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
-		downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+		/* extract block number of the left-hand split page */
+		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+		cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
 		Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+		/* Clear the incomplete-split flag in left child */
+		if (record->xl_info & XLR_BKP_BLOCK(0))
+			(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		else
+			_bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
 	}
 
 	PageSetLSN(page, lsn);
@@ -1031,10 +1040,6 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
 	_bt_restore_meta(xlrec->node, lsn,
 					 xlrec->rootblk, xlrec->level,
 					 xlrec->rootblk, xlrec->level);
-
-	/* Check to see if this satisfies any incomplete insertions */
-	if (record->xl_len > SizeOfBtreeNewroot)
-		forget_matching_split(xlrec->node, downlink, true);
 }
 
 static void
@@ -1114,59 +1119,3 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
 }
-
-void
-btree_xlog_startup(void)
-{
-	incomplete_splits = NIL;
-}
-
-void
-btree_xlog_cleanup(void)
-{
-	ListCell   *l;
-
-	foreach(l, incomplete_splits)
-	{
-		/* finish an incomplete split */
-		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
-		Buffer		lbuf,
-					rbuf;
-		Page		lpage,
-					rpage;
-		BTPageOpaque lpageop,
-					rpageop;
-		bool		is_only;
-		Relation	reln;
-
-		lbuf = XLogReadBuffer(action->node, action->leftblk, false);
-		/* failure is impossible because we wrote this page earlier */
-		if (!BufferIsValid(lbuf))
-			elog(PANIC, "btree_xlog_cleanup: left block unfound");
-		lpage = (Page) BufferGetPage(lbuf);
-		lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
-		rbuf = XLogReadBuffer(action->node, action->rightblk, false);
-		/* failure is impossible because we wrote this page earlier */
-		if (!BufferIsValid(rbuf))
-			elog(PANIC, "btree_xlog_cleanup: right block unfound");
-		rpage = (Page) BufferGetPage(rbuf);
-		rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
-		/* if the pages are all of their level, it's a only-page split */
-		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
-		reln = CreateFakeRelcacheEntry(action->node);
-		_bt_insert_parent(reln, lbuf, rbuf, NULL,
-						  action->is_root, is_only);
-		FreeFakeRelcacheEntry(reln);
-	}
-	incomplete_splits = NIL;
-}
-
-bool
-btree_safe_restartpoint(void)
-{
-	if (incomplete_splits)
-		return false;
-	return true;
-}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a1fb948..f974ea2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_HALF_DEAD	(1 << 4)	/* empty, but still in tree */
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.	This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
 #define P_ISHALFDEAD(opaque)	((opaque)->btpo_flags & BTP_HALF_DEAD)
 #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
 #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque)	((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -253,7 +255,7 @@ typedef struct xl_btree_metadata
 typedef struct xl_btree_insert
 {
 	xl_btreetid target;			/* inserted tuple id */
-	/* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+	/* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
 	/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
 } xl_btree_insert;
@@ -286,19 +288,18 @@ typedef struct xl_btree_split
 	OffsetNumber firstright;	/* first item moved to right page */
 
 	/*
-	 * If level > 0, BlockIdData downlink follows.	(We use BlockIdData rather
-	 * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
-	 * aligned.)
+	 * In the _L variants, next are OffsetNumber newitemoff and the new item.
+	 * (In the _R variants, the new item is one of the right page's tuples.)
+	 * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+	 * to store the left page's whole page image.
 	 *
 	 * If level > 0, an IndexTuple representing the HIKEY of the left page
 	 * follows.  We don't need this on leaf pages, because it's the same as
 	 * the leftmost key in the new right page.	Also, it's suppressed if
 	 * XLogInsert chooses to store the left page's whole page image.
 	 *
-	 * In the _L variants, next are OffsetNumber newitemoff and the new item.
-	 * (In the _R variants, the new item is one of the right page's tuples.)
-	 * The new item, but not newitemoff, is suppressed if XLogInsert chooses
-	 * to store the left page's whole page image.
+	 * If level > 0, BlockNumber of the page whose incomplete-split flag
+	 * this insertion clears. (not aligned)
 	 *
 	 * Last are the right page's tuples in the form used by _bt_restore_page.
 	 */
@@ -642,8 +643,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
-				  BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -673,7 +673,8 @@ extern BTStack _bt_search(Relation rel,
 		   int keysz, ScanKey scankey, bool nextkey,
 		   Buffer *bufP, int access);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, int access);
+			  ScanKey scankey, bool nextkey,
+			  bool finishsplits, BTStack stack,  int access);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
 			ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
@@ -722,8 +723,5 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
  */
 extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
-extern void btree_xlog_startup(void);
-extern void btree_xlog_cleanup(void);
-extern bool btree_safe_restartpoint(void);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c968101..d9ee57d 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL)
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, NULL, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, NULL, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, NULL, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, NULL)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to