While looking at Alexander's GIN patch, I noticed an ancient bug in the WAL-logging of GIN entry-tree insertions. entryPlaceToPage and dataPlacetoPage functions don't make a full-page image of the page, when inserting a downlink on a non-leaf page. The comment says:

        /*
         * Prevent full page write if child's split occurs. That is needed to
         * remove incomplete splits while replaying WAL
         *
         * data.updateBlkno contains new block number (of newly created right
         * page) for recently splited page.
         */


The code is doing what the comment says, but that's wrong. You can't just skip the full page write, it's needed for torn page protection like in any other case. The correct fix would've been to change the redo-routine to do the incomplete split tracking for the page even if it's restored from a full page image.

This was broken by this commit back in 2007:

commit 853d1c3103fa961ae6219f0281885b345593d101
Author: Teodor Sigaev <teo...@sigaev.ru>
Date:   Mon Jun 4 15:56:28 2007 +0000

    Fix bundle bugs of GIN:
    - Fix possible deadlock between UPDATE and VACUUM queries. Bug never was
      observed in 8.2, but it still exist there. HEAD is more sensitive to
      bug after recent "ring" of buffer improvements.
    - Fix WAL creation: if parent page is stored as is after split then
      incomplete split isn't removed during replay. This happens rather rare, 
only
      on large tables with a lot of updates/inserts.
    - Fix WAL replay: there was wrong test of XLR_BKP_BLOCK_* for left
      page after deletion of page. That causes wrong rightlink field: it pointed
      to deleted page.
    - add checking of match of clearing incomplete split
    - cleanup incomplete split list after proceeding

    All of this chages doesn't change on-disk storage, so backpatch...
    But second point may be an issue for replaying logs from previous version.

The relevant part is the "Fix WAL creation" item. I searched the archives but couldn't find any discussion leading to this fix.

In 2010, Tom actually fixed the redo-routine in commit 4016bdef8aded77b4903c457050622a5a1815c16, along with other fixes. So all we need to do now is to fix that bogus logic in entryPlaceToPage to not skip the full-page-image.

Attached is a patch for 9.3. I've whacked the code a lot in master, but the same bug is present there too.

- Heikki
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 13ab448..32eba4c 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -381,7 +381,6 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
 {
 	Page		page = BufferGetPage(buf);
 	int			sizeofitem = GinSizeOfDataPageItem(page);
-	int			cnt = 0;
 
 	/* these must be static so they can be returned to caller */
 	static XLogRecData rdata[3];
@@ -401,32 +400,25 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
 	data.isLeaf = GinPageIsLeaf(page) ? TRUE : FALSE;
 
 	/*
-	 * Prevent full page write if child's split occurs. That is needed to
-	 * remove incomplete splits while replaying WAL
-	 *
-	 * data.updateBlkno contains new block number (of newly created right
-	 * page) for recently splited page.
+	 * For incomplete-split tracking, we need the updateBlkno information and
+	 * the inserted item even when we make a full page image of the page, so
+	 * put the buffer reference in a separate XLogRecData entry.
 	 */
-	if (data.updateBlkno == InvalidBlockNumber)
-	{
-		rdata[0].buffer = buf;
-		rdata[0].buffer_std = FALSE;
-		rdata[0].data = NULL;
-		rdata[0].len = 0;
-		rdata[0].next = &rdata[1];
-		cnt++;
-	}
+	rdata[0].buffer = buf;
+	rdata[0].buffer_std = FALSE;
+	rdata[0].data = NULL;
+	rdata[0].len = 0;
+	rdata[0].next = &rdata[1];
 
-	rdata[cnt].buffer = InvalidBuffer;
-	rdata[cnt].data = (char *) &data;
-	rdata[cnt].len = sizeof(ginxlogInsert);
-	rdata[cnt].next = &rdata[cnt + 1];
-	cnt++;
+	rdata[1].buffer = InvalidBuffer;
+	rdata[1].data = (char *) &data;
+	rdata[1].len = sizeof(ginxlogInsert);
+	rdata[1].next = &rdata[2];
 
-	rdata[cnt].buffer = InvalidBuffer;
-	rdata[cnt].data = (GinPageIsLeaf(page)) ? ((char *) (btree->items + btree->curitem)) : ((char *) &(btree->pitem));
-	rdata[cnt].len = sizeofitem;
-	rdata[cnt].next = NULL;
+	rdata[2].buffer = InvalidBuffer;
+	rdata[2].data = (GinPageIsLeaf(page)) ? ((char *) (btree->items + btree->curitem)) : ((char *) &(btree->pitem));
+	rdata[2].len = sizeofitem;
+	rdata[2].next = NULL;
 
 	if (GinPageIsLeaf(page))
 	{
@@ -442,7 +434,7 @@ dataPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prda
 				btree->curitem++;
 			}
 			data.nitem = btree->curitem - savedPos;
-			rdata[cnt].len = sizeofitem * data.nitem;
+			rdata[2].len = sizeofitem * data.nitem;
 		}
 		else
 		{
diff --git a/src/backend/access/gin/ginentrypage.c b/src/backend/access/gin/ginentrypage.c
index 8ead38f..f6c0632 100644
--- a/src/backend/access/gin/ginentrypage.c
+++ b/src/backend/access/gin/ginentrypage.c
@@ -486,7 +486,6 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prd
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber placed;
-	int			cnt = 0;
 
 	/* these must be static so they can be returned to caller */
 	static XLogRecData rdata[3];
@@ -509,32 +508,25 @@ entryPlaceToPage(GinBtree btree, Buffer buf, OffsetNumber off, XLogRecData **prd
 	data.isLeaf = GinPageIsLeaf(page) ? TRUE : FALSE;
 
 	/*
-	 * Prevent full page write if child's split occurs. That is needed to
-	 * remove incomplete splits while replaying WAL
-	 *
-	 * data.updateBlkno contains new block number (of newly created right
-	 * page) for recently splited page.
+	 * For incomplete-split tracking, we need the updateBlkno information and
+	 * the inserted item even when we make a full page image of the page, so
+	 * put the buffer reference in a separate XLogRecData entry.
 	 */
-	if (data.updateBlkno == InvalidBlockNumber)
-	{
-		rdata[0].buffer = buf;
-		rdata[0].buffer_std = TRUE;
-		rdata[0].data = NULL;
-		rdata[0].len = 0;
-		rdata[0].next = &rdata[1];
-		cnt++;
-	}
-
-	rdata[cnt].buffer = InvalidBuffer;
-	rdata[cnt].data = (char *) &data;
-	rdata[cnt].len = sizeof(ginxlogInsert);
-	rdata[cnt].next = &rdata[cnt + 1];
-	cnt++;
+	rdata[0].buffer = buf;
+	rdata[0].buffer_std = TRUE;
+	rdata[0].data = NULL;
+	rdata[0].len = 0;
+	rdata[0].next = &rdata[1];
 
-	rdata[cnt].buffer = InvalidBuffer;
-	rdata[cnt].data = (char *) btree->entry;
-	rdata[cnt].len = IndexTupleSize(btree->entry);
-	rdata[cnt].next = NULL;
+	rdata[1].buffer = InvalidBuffer;
+	rdata[1].data = (char *) &data;
+	rdata[1].len = sizeof(ginxlogInsert);
+	rdata[1].next = &rdata[2];
+
+	rdata[2].buffer = InvalidBuffer;
+	rdata[2].data = (char *) btree->entry;
+	rdata[2].len = IndexTupleSize(btree->entry);
+	rdata[2].next = NULL;
 
 	btree->entry = NULL;
 }
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to