On Tue, Aug 13, 2019 at 5:17 AM Andrey Borodin <x4...@yandex-team.ru> wrote: > We have a bunch of internal testing HA clusters that suffered from corruption > conditions. > We fixed everything that can be detected with parent-check on primaries or > usual check on standbys. > (page updates were lost both on primary and during WAL replay) > But from time to time when clusters switch primary from one availability zone > to another we observe > "right sibling's left-link doesn't match: block 32709 links to 37022 instead > of expected 40953 in index"
That sounds like an issue caused by a failure to replay all available WAL, where only one page happened to get written out by a checkpoint before a crash. It's something like that. That wouldn't be caught by the cross-page bt_index_check() check that we do already. > We are going to search for these clusters with this [0] tolerating possible > fraction of false positives, we have them anyway. > But I think I could put some effort into making corruption-detection tooling > better. > I think if we observe links discrepancy, we can acquire lock of left and > right pages and recheck. That's one possibility. When I first designed amcheck it was important to be conservative, so I invented a general rule about never acquiring multiple buffer locks at once. I still think that that was the correct decision for the bt_downlink_check() check (the main extra bt_index_parent_check() check), but I think that you're right about retrying to verify the sibling links when bt_index_check() is called from SQL. nbtree will often "couple" buffer locks on the leaf level; it will acquire a lock on a leaf page, and not release that lock until it has also acquired a lock on the right sibling page (I'm mostly thinking of _bt_stepright()). I am in favor of a patch that makes amcheck perform sibling link verification within bt_index_check(), by retrying while pessimistically coupling buffer locks. (Though I think that that should just happen on the leaf level. We should not try to be too clever about ignorable/half-dead/deleted pages, to be conservative.) -- Peter Geoghegan