On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4...@yandex-team.ru> wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they saw something similar too. > So I toyed around and accidentally wrote a test that reproduces $subj. > > I think the corruption happens as follows: > 0. we create a table with one frozen tuple > 1. next heap_insert() clears VM bit and hangs immediately, nothing was logged > yet > 2. VM buffer is flushed on disk with checkpointer or bgwriter > 3. primary is killed with -9 > now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM > bits on primary > 4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL > record > 5. pg_visibility detects corruption > > Interestingly, in an off-list conversation Melanie explained me how > ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE > heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case: > > /* Clear only the all-frozen bit on visibility map if needed */ > if (PageIsAllVisible(page) && > visibilitymap_clear(relation, block, vmbuffer, > VISIBILITYMAP_ALL_FROZEN)) > cleared_all_frozen = true; // this won't happen due to flushed VM > buffer before a crash > > Anyway, the test reproduces corruption of both bits. And also reproduces > selecting deleted data on standby. > > The test is not intended to be committed when we fix the problem, so some > waits are simulated with sleep(1) and test is placed at modules/test_slru > where it was easier to write. But if we ever want something like this - I can > design a less hacky version. And, probably, more generic. > > Thanks! > > > Best regards, Andrey Borodin. > > >
Attached reproduces the same but without any standby node. CHECKPOINT somehow manages to flush the heap page when instance kill-9-ed. As a result, we have inconsistency between heap and VM pages: ``` reshke=# select * from pg_visibility('x'); blkno | all_visible | all_frozen | pd_all_visible -------+-------------+------------+---------------- 0 | t | t | f (1 row) ``` Notice I moved INJECTION point one line above visibilitymap_clear. Without this change, such behaviour also reproduced, but with much less frequency. -- Best regards, Kirill Reshke
v2-0001-Corrupt-VM-on-standby.patch
Description: Binary data