On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <n...@leadboat.com> wrote: > Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c): > > * 1 > * / \ > * 2 <-> 3 > > We're deleting 2, the leftmost leaf under a leftmost internal page. After the > MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which > still has a btpo_prev pointing to 2. bt_index_parent_check() complains here:
Thanks for working on this. Your analysis seems sound to me. When I reviewed the patch that became commit d114cc53, I presented Alexander with a test case that demonstrated false positive reports of corruption involving interrupted page splits (not interrupted page deletions). Obviously I didn't give sufficient thought to this case, which is analogous. Might make sense to test the fix for this issue using a similar approach: by adding custom code that randomly throws errors at a point that stresses the implementation. I'm referring to the point at which VACUUM is between the first and second phase of page deletion: right before (or directly after) _bt_unlink_halfdead_page() is called. That isn't fundamentally different to the approach from your TAP test, but seems like it might add some interesting variation. > One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record > and its corresponding UNLINK_PAGE record. See the attached test case. The > index is actually fine in such a state, right? Yes, it is fine. FWIW, this feels like it might be related to the fact that (unlike Lanin & Shasha), we don't make the key space move left; we make it move right instead (just like page splits). In other words, page deletion isn't the precise opposite of a page split, which is a bit awkward. Note, in particular, that _bt_mark_page_halfdead() doesn't do a straight delete of the pivot tuple in the parent page that points to the target page, as you might expect. It actually deletes the right sibling of the target page's pivot, and then performs an in-place overwrite of the downlink from the pivot tuple that originally pointed to the target page. Perhaps this isn't worth going into now, but thought you might appreciate the context. Terminology note: we sometimes use "downlink" as a synonym of "pivot tuple" or even "separator key", which is misleading. > I lean toward fixing this by > having amcheck scan left; if left links reach only half-dead or deleted pages, > that's as good as the present child block being P_LEFTMOST. Also my preference. > There's a > different error from bt_index_check(), and I've not yet studied how to fix > that: > > ERROR: left link/right link pair in index "not_leftmost_pk" not in > agreement > DETAIL: Block=0 left block=0 left link from block=4294967295. This looks like this might be a straightforward case of confusing P_NONE for InvalidBlockNumber (or vice-versa). They're both used to indicate "no such block" here. > Alternatively, one could view this as a need for the user to VACUUM between > recovery and amcheck. The documentation could direct users to "VACUUM > (DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since > last recovery. Does anyone prefer that or some other alternative? I'd rather not go that route. That strikes me as defining the problem out of existence. > For some other amcheck expectations, the comments suggest reliance on the > bt_index_parent_check() ShareLock. I haven't tried to make test cases for > them, but perhaps recovery can trick them the same way. Examples: > > errmsg("downlink or sibling link points to deleted block in index \"%s\"", > errmsg("block %u is not leftmost in index \"%s\"", > errmsg("block %u is not true root in index \"%s\"", These are all much older. They're certainly all from before the relevant checks were first added (by commit d114cc53), and seem much less likely to be buggy. These older cases are all cases where we descend directly from an internal page to one of its child pages. Whereas the problem you've demonstrated involves traversal across levels *and* across siblings in newer code. That's quite a bit more complicated, since it requires that we worry about both phases of page deletion -- not just the first. That in itself necessitates that we deal with various edge cases. (The really prominent edge-case is the interrupted page deletion case, which requires significant handling, but evidently missed a subtlety with leftmost pages). -- Peter Geoghegan