Hello, hackers. ------ ABSTRACT ------ There is a race condition between btree_xlog_unlink_page and _bt_walk_left. A lot of versions are affected including 12 and new-coming 13. Happens only on standby. Seems like could not cause invalid query results.
------ REMARK ------ While working on support for index hint bits on standby [1] I have started to getting "ERROR: could not find left sibling of block XXXX in index XXXX" during stress tests. I was sure I have broken something in btree and spent a lot of time trying to figure what. And later... I realized what it is bug in btree since a very old times... Because of much faster scans with LP_DEAD support on a standby it happens much more frequently in my case. ------ HOW TO REPRODUCE ------ It is not easy to reproduce the issue but you can try (tested on REL_12_STABLE and master): 1) Setup master (sync replica and 'remote_apply' are not required - just make scripts simpler): autovacuum = off synchronous_standby_names = '*' synchronous_commit = 'remote_apply' 2) Setup standby: primary_conninfo = 'user=postgres host=127.0.0.1 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any' port = 6543 3) Prepare pgbench file with content (test.bench): BEGIN; select * from pgbench_accounts order by aid desc limit 1; END; 4) Prepare the index: ./pgbench -i -s 10 -U postgres ./psql -U postgres -c "delete from pgbench_accounts where aid IN (select aid from pgbench_accounts order by aid desc limit 500000)" 5) Start index scans on the standby: ./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U postgres -p 6543 6) Run vacuum on the master: ./psql -U postgres -c "vacuum pgbench_accounts" 7) You should see something like this: > progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902 > ..... > progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338 > client 3 script 0 aborted in command 1 query 0: ERROR: could not find left > sibling of block 1451 in index "pgbench_accounts_pkey" > progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709 > ..... > progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238 > client 2 script 0 aborted in command 1 query 0: ERROR: could not find left > sibling of block 2104 in index "pgbench_accounts_pkey" > progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417 > ..... > progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251 > client 4 script 0 aborted in command 1 query 0: ERROR: could not find left > sibling of block 2501 in index "pgbench_accounts_pkey" I was able to reproduce issue with vanilla PG_12 on different systems including the Windows machine. On some servers it happens 100% times. On others - very seldom. It is possible to radically increase chance to reproduce the issue by adding a sleep in btree_xlog_unlink_page[7]: > + pg_usleep(10 * 1000L); > > /* Rewrite target page as empty deleted page */ > buffer = XLogInitBufferForRedo(record, 0); ------ WHAT HAPPENS ------ It is race condition issue between btree_xlog_unlink_page and _bt_walk_left. btree_xlog_unlink_page removes page from btree changing btpo_prev and btpo_next of its left and right siblings to point to the each other and marks target page as removed. All these operations are done using one-page-at-a-time locking because of[4]: > * In normal operation, we would lock all the pages this WAL record > * touches before changing any of them. In WAL replay, it should be okay > * to lock just one page at a time, since no concurrent index updates can > * be happening, and readers should not care whether they arrive at the > * target page or not (since it's surely empty). _bt_walk_left walks left in very tricky way. Please refer to src/backend/access/nbtree/README for details[5]: > Moving left in a backward scan is complicated because we must consider > the possibility that the left sibling was just split (meaning we must find > the rightmost page derived from the left sibling), plus the possibility > that the page we were just on has now been deleted and hence isn't in the > sibling chain at all anymore. So, this is how race is happens: 0) this is target page (B) and its siblings. A <---> B <---> C ---> END 1) walreceiver starts btree_xlog_unlink_page for the B. It is changes the links from C to A and from A to C (I hope my scheme will be displayed correctly): A <---- B ----> C ---> END ^ ^ \_____________/ But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2]. 2) other backend starts _bt_walk_left from B. It checks A, goes to from A to C by updated btpo_next and later sees end of the btree. So, next step is to check if B was deleted (nbtsearch:2011)[3] and try to recover. But B is not yet deleted! It will be marked as BTP_DELETED after a few millis by walreceiver but not yet. So, nbtsearch:2046[6] is happens. ------ HOW TO FIX ------ The first idea was to mark page as BTP_DELETED before updating siblings links. Second - to update pages in the following order: * change btpo_next * mark as BTP_DELETED * change btpo_prev Such a changes fix the exactly described above race condition... but cause a more tricky ones to start happening. And I think it is better to avoid any too complex unclear solutions here.. Another idea - to sleep a little waiting walreceiver to mark the page as deleted. But it seems to feel too ugly. Also it is unclear how long to wait. So, I think right way is to lock all three pages as it is done on the primary. As far as I can see it is not causes any real performance regression. Patch is attached (on top of REL_12_STABLE). Thanks, Michail. [1]: https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com [2]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697 [3]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011 [4]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581 [5]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314 [6]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046 [7]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index dd5315c1aa..4e302fdd69 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -649,7 +649,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record); BlockNumber leftsib; BlockNumber rightsib; - Buffer buffer; + Buffer leafbuf, lbuff = InvalidBuffer, rbuff, buff; Page page; BTPageOpaque pageop; @@ -657,47 +657,29 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) rightsib = xlrec->rightsib; /* - * In normal operation, we would lock all the pages this WAL record - * touches before changing any of them. In WAL replay, it should be okay - * to lock just one page at a time, since no concurrent index updates can - * be happening, and readers should not care whether they arrive at the - * target page or not (since it's surely empty). + * We have to lock the pages we need to modify in the moving right order. + * Else we will go into the race against _bt_walk_left. */ - /* Fix left-link of right sibling */ - if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO) - { - page = (Page) BufferGetPage(buffer); - pageop = (BTPageOpaque) PageGetSpecialPointer(page); - pageop->btpo_prev = leftsib; - - PageSetLSN(page, lsn); - MarkBufferDirty(buffer); - } - if (BufferIsValid(buffer)) - UnlockReleaseBuffer(buffer); - /* Fix right-link of left sibling, if any */ if (leftsib != P_NONE) { - if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO) + if (XLogReadBufferForRedo(record, 1, &lbuff) == BLK_NEEDS_REDO) { - page = (Page) BufferGetPage(buffer); + page = (Page) BufferGetPage(lbuff); pageop = (BTPageOpaque) PageGetSpecialPointer(page); pageop->btpo_next = rightsib; PageSetLSN(page, lsn); - MarkBufferDirty(buffer); + MarkBufferDirty(lbuff); } - if (BufferIsValid(buffer)) - UnlockReleaseBuffer(buffer); } /* Rewrite target page as empty deleted page */ - buffer = XLogInitBufferForRedo(record, 0); - page = (Page) BufferGetPage(buffer); + buff = XLogInitBufferForRedo(record, 0); + page = (Page) BufferGetPage(buff); - _bt_pageinit(page, BufferGetPageSize(buffer)); + _bt_pageinit(page, BufferGetPageSize(buff)); pageop = (BTPageOpaque) PageGetSpecialPointer(page); pageop->btpo_prev = leftsib; @@ -707,9 +689,26 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) pageop->btpo_cycleid = 0; PageSetLSN(page, lsn); - MarkBufferDirty(buffer); - UnlockReleaseBuffer(buffer); + MarkBufferDirty(buff); + /* Fix left-link of right sibling */ + if (XLogReadBufferForRedo(record, 2, &rbuff) == BLK_NEEDS_REDO) + { + page = (Page) BufferGetPage(rbuff); + pageop = (BTPageOpaque) PageGetSpecialPointer(page); + pageop->btpo_prev = leftsib; + + PageSetLSN(page, lsn); + MarkBufferDirty(rbuff); + } + + /* Release all buffers */ + if (BufferIsValid(lbuff)) + UnlockReleaseBuffer(lbuff); + UnlockReleaseBuffer(buff); + if (BufferIsValid(rbuff)) + UnlockReleaseBuffer(rbuff); + /* * If we deleted a parent of the targeted leaf page, instead of the leaf * itself, update the leaf to point to the next remaining child in the @@ -723,10 +722,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) */ IndexTupleData trunctuple; - buffer = XLogInitBufferForRedo(record, 3); - page = (Page) BufferGetPage(buffer); + leafbuf = XLogInitBufferForRedo(record, 3); + page = (Page) BufferGetPage(leafbuf); - _bt_pageinit(page, BufferGetPageSize(buffer)); + _bt_pageinit(page, BufferGetPageSize(leafbuf)); pageop = (BTPageOpaque) PageGetSpecialPointer(page); pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF; @@ -745,8 +744,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record) elog(ERROR, "could not add dummy high key to half-dead page"); PageSetLSN(page, lsn); - MarkBufferDirty(buffer); - UnlockReleaseBuffer(buffer); + MarkBufferDirty(leafbuf); + UnlockReleaseBuffer(leafbuf); } /* Update metapage if needed */