Hello, hackers.
------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.
------ REMARK ------
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR: could not find left sibling of block XXXX in index XXXX"
during stress tests.
I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later... I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.
------ HOW TO REPRODUCE ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):
1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'
2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543
3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;
4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"
5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543
6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"
7) You should see something like this:
> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .....
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR: could not find left
> sibling of block 1451 in index "pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .....
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR: could not find left
> sibling of block 2104 in index "pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .....
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR: could not find left
> sibling of block 2501 in index "pgbench_accounts_pkey"
I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.
It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
> + pg_usleep(10 * 1000L);
>
> /* Rewrite target page as empty deleted page */
> buffer = XLogInitBufferForRedo(record, 0);
------ WHAT HAPPENS ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.
btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:
> * In normal operation, we would lock all the pages this WAL record
> * touches before changing any of them. In WAL replay, it should be okay
> * to lock just one page at a time, since no concurrent index updates can
> * be happening, and readers should not care whether they arrive at the
> * target page or not (since it's surely empty).
_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:
> Moving left in a backward scan is complicated because we must consider
> the possibility that the left sibling was just split (meaning we must find
> the rightmost page derived from the left sibling), plus the possibility
> that the page we were just on has now been deleted and hence isn't in the
> sibling chain at all anymore.
So, this is how race is happens:
0) this is target page (B) and its siblings.
A <---> B <---> C ---> END
1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^ ^
\_____________/
But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].
2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.
But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.
------ HOW TO FIX ------
The first idea was to mark page as BTP_DELETED before updating siblings links.
Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev
Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..
Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.
So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.
Patch is attached (on top of REL_12_STABLE).
Thanks,
Michail.
[1]:
https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[2]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]:
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..4e302fdd69 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -649,7 +649,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
BlockNumber leftsib;
BlockNumber rightsib;
- Buffer buffer;
+ Buffer leafbuf, lbuff = InvalidBuffer, rbuff, buff;
Page page;
BTPageOpaque pageop;
@@ -657,47 +657,29 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
rightsib = xlrec->rightsib;
/*
- * In normal operation, we would lock all the pages this WAL record
- * touches before changing any of them. In WAL replay, it should be okay
- * to lock just one page at a time, since no concurrent index updates can
- * be happening, and readers should not care whether they arrive at the
- * target page or not (since it's surely empty).
+ * We have to lock the pages we need to modify in the moving right order.
+ * Else we will go into the race against _bt_walk_left.
*/
- /* Fix left-link of right sibling */
- if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
- {
- page = (Page) BufferGetPage(buffer);
- pageop = (BTPageOpaque) PageGetSpecialPointer(page);
- pageop->btpo_prev = leftsib;
-
- PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
- }
- if (BufferIsValid(buffer))
- UnlockReleaseBuffer(buffer);
-
/* Fix right-link of left sibling, if any */
if (leftsib != P_NONE)
{
- if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+ if (XLogReadBufferForRedo(record, 1, &lbuff) == BLK_NEEDS_REDO)
{
- page = (Page) BufferGetPage(buffer);
+ page = (Page) BufferGetPage(lbuff);
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
pageop->btpo_next = rightsib;
PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
+ MarkBufferDirty(lbuff);
}
- if (BufferIsValid(buffer))
- UnlockReleaseBuffer(buffer);
}
/* Rewrite target page as empty deleted page */
- buffer = XLogInitBufferForRedo(record, 0);
- page = (Page) BufferGetPage(buffer);
+ buff = XLogInitBufferForRedo(record, 0);
+ page = (Page) BufferGetPage(buff);
- _bt_pageinit(page, BufferGetPageSize(buffer));
+ _bt_pageinit(page, BufferGetPageSize(buff));
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
pageop->btpo_prev = leftsib;
@@ -707,9 +689,26 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
pageop->btpo_cycleid = 0;
PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
+ MarkBufferDirty(buff);
+ /* Fix left-link of right sibling */
+ if (XLogReadBufferForRedo(record, 2, &rbuff) == BLK_NEEDS_REDO)
+ {
+ page = (Page) BufferGetPage(rbuff);
+ pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ pageop->btpo_prev = leftsib;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(rbuff);
+ }
+
+ /* Release all buffers */
+ if (BufferIsValid(lbuff))
+ UnlockReleaseBuffer(lbuff);
+ UnlockReleaseBuffer(buff);
+ if (BufferIsValid(rbuff))
+ UnlockReleaseBuffer(rbuff);
+
/*
* If we deleted a parent of the targeted leaf page, instead of the leaf
* itself, update the leaf to point to the next remaining child in the
@@ -723,10 +722,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
*/
IndexTupleData trunctuple;
- buffer = XLogInitBufferForRedo(record, 3);
- page = (Page) BufferGetPage(buffer);
+ leafbuf = XLogInitBufferForRedo(record, 3);
+ page = (Page) BufferGetPage(leafbuf);
- _bt_pageinit(page, BufferGetPageSize(buffer));
+ _bt_pageinit(page, BufferGetPageSize(leafbuf));
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -745,8 +744,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
elog(ERROR, "could not add dummy high key to half-dead page");
PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
+ MarkBufferDirty(leafbuf);
+ UnlockReleaseBuffer(leafbuf);
}
/* Update metapage if needed */