Hello, hackers.

------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

------ REMARK ------
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR:  could not find left sibling of block XXXX in index XXXX"
during stress tests.

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later...  I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

------ HOW TO REPRODUCE  ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):

1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'

2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543

3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;

4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"

5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543

6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"

7) You should see something like this:
> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .....
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 1451 in index "pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .....
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 2104 in index "pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .....
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR:  could not find left 
> sibling of block 2501 in index "pgbench_accounts_pkey"

I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.

It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
>   + pg_usleep(10 * 1000L);
>
>   /* Rewrite target page as empty deleted page */
>   buffer = XLogInitBufferForRedo(record, 0);

------  WHAT HAPPENS  ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.

btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:

>    * In normal operation, we would lock all the pages this WAL record
>    * touches before changing any of them.  In WAL replay, it should be okay
>    * to lock just one page at a time, since no concurrent index updates can
>    * be happening, and readers should not care whether they arrive at the
>    * target page or not (since it's surely empty).

_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:

>    Moving left in a backward scan is complicated because we must consider
>    the possibility that the left sibling was just split (meaning we must find
>    the rightmost page derived from the left sibling), plus the possibility
>    that the page we were just on has now been deleted and hence isn't in the
>    sibling chain at all anymore.

So, this is how race is happens:

0) this is target page (B) and its siblings.
A <---> B <---> C ---> END

1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^               ^
 \_____________/

But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].

2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.

But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.


------  HOW TO FIX  ------
The first idea was to mark page as BTP_DELETED before updating siblings links.

Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev

Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..

Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.

So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.

Patch is attached (on top of REL_12_STABLE).

Thanks,
Michail.

[1]: 
https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[2]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]: 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..4e302fdd69 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -649,7 +649,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
-	Buffer		buffer;
+	Buffer		leafbuf, lbuff = InvalidBuffer, rbuff, buff;
 	Page		page;
 	BTPageOpaque pageop;
 
@@ -657,47 +657,29 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	rightsib = xlrec->rightsib;
 
 	/*
-	 * In normal operation, we would lock all the pages this WAL record
-	 * touches before changing any of them.  In WAL replay, it should be okay
-	 * to lock just one page at a time, since no concurrent index updates can
-	 * be happening, and readers should not care whether they arrive at the
-	 * target page or not (since it's surely empty).
+	 * We have to lock the pages we need to modify in the moving right order.
+	 * Else we will go into the race against _bt_walk_left.
 	 */
 
-	/* Fix left-link of right sibling */
-	if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		pageop->btpo_prev = leftsib;
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
 	/* Fix right-link of left sibling, if any */
 	if (leftsib != P_NONE)
 	{
-		if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+		if (XLogReadBufferForRedo(record, 1, &lbuff) == BLK_NEEDS_REDO)
 		{
-			page = (Page) BufferGetPage(buffer);
+			page = (Page) BufferGetPage(lbuff);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 			pageop->btpo_next = rightsib;
 
 			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
+			MarkBufferDirty(lbuff);
 		}
-		if (BufferIsValid(buffer))
-			UnlockReleaseBuffer(buffer);
 	}
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogInitBufferForRedo(record, 0);
-	page = (Page) BufferGetPage(buffer);
+	buff = XLogInitBufferForRedo(record, 0);
+	page = (Page) BufferGetPage(buff);
 
-	_bt_pageinit(page, BufferGetPageSize(buffer));
+	_bt_pageinit(page, BufferGetPageSize(buff));
 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	pageop->btpo_prev = leftsib;
@@ -707,9 +689,26 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	pageop->btpo_cycleid = 0;
 
 	PageSetLSN(page, lsn);
-	MarkBufferDirty(buffer);
-	UnlockReleaseBuffer(buffer);
+	MarkBufferDirty(buff);
 
+	/* Fix left-link of right sibling */
+	if (XLogReadBufferForRedo(record, 2, &rbuff) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(rbuff);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		pageop->btpo_prev = leftsib;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(rbuff);
+	}
+
+	/* Release all buffers */
+	if (BufferIsValid(lbuff))
+		UnlockReleaseBuffer(lbuff);
+	UnlockReleaseBuffer(buff);
+	if (BufferIsValid(rbuff))
+		UnlockReleaseBuffer(rbuff);
+
 	/*
 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
 	 * itself, update the leaf to point to the next remaining child in the
@@ -723,10 +722,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		 */
 		IndexTupleData trunctuple;
 
-		buffer = XLogInitBufferForRedo(record, 3);
-		page = (Page) BufferGetPage(buffer);
+		leafbuf = XLogInitBufferForRedo(record, 3);
+		page = (Page) BufferGetPage(leafbuf);
 
-		_bt_pageinit(page, BufferGetPageSize(buffer));
+		_bt_pageinit(page, BufferGetPageSize(leafbuf));
 		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -745,8 +744,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 			elog(ERROR, "could not add dummy high key to half-dead page");
 
 		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-		UnlockReleaseBuffer(buffer);
+		MarkBufferDirty(leafbuf);
+		UnlockReleaseBuffer(leafbuf);
 	}
 
 	/* Update metapage if needed */

Reply via email to