Stephen Frost <sfr...@snowman.net> writes: > On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas.mu...@gmail.com> wrote: >> ... Personally I think the right thing to do now is to revert it >> and re-propose for 15 early in the cycle, supported with some better >> testing infrastructure.
> I tend to agree with the idea to revert it, perhaps a +0 on that, but if > others argue it should be fixed in-place, I wouldn’t complain about it. FWIW, I've so far only been able to see problems on two old PPC Macs, one of which has been known to be a bit flaky in the past. So it's possible that what I'm looking at is a hardware glitch. But it's consistent enough that I rather doubt that. What I'm doing is running the core regression tests with a single standby (on the same machine) and wal_consistency_checking = all. Fairly reproducibly (more than one run in ten), what I get on the slightly-flaky machine is consistency check failures like 2021-04-21 17:42:56.324 EDT [42286] PANIC: inconsistent page found, rel 1663/354383/357033, forknum 0, blkno 9, byte offset 2069: replay 0x00 primary 0x03 2021-04-21 17:42:56.324 EDT [42286] CONTEXT: WAL redo at 24/121C97B0 for Heap/INSERT: off 107 flags 0x00; blkref #0: rel 1663/354383/357033, blk 9 FPW 2021-04-21 17:45:11.662 EDT [42284] LOG: startup process (PID 42286) was terminated by signal 6: Abort trap 2021-04-21 11:25:30.091 EDT [38891] PANIC: inconsistent page found, rel 1663/229880/237980, forknum 0, blkno 108, byte offset 3845: replay 0x00 primary 0x99 2021-04-21 11:25:30.091 EDT [38891] CONTEXT: WAL redo at 17/A99897FC for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/229880/237980, blk 108 FPW 2021-04-21 11:26:59.371 EDT [38889] LOG: startup process (PID 38891) was terminated by signal 6: Abort trap 2021-04-20 19:20:16.114 EDT [34405] PANIC: inconsistent page found, rel 1663/189216/197311, forknum 0, blkno 115, byte offset 6149: replay 0x37 primary 0x03 2021-04-20 19:20:16.114 EDT [34405] CONTEXT: WAL redo at 13/3CBFED00 for SPGist/ADD_LEAF: add leaf to page; off 241; headoff 171; parentoff 0; blkref #0: rel 1663/189216/197311, blk 115 FPW 2021-04-20 19:21:54.421 EDT [34403] LOG: startup process (PID 34405) was terminated by signal 6: Abort trap 2021-04-20 17:44:09.356 EDT [24106] FATAL: inconsistent page found, rel 1663/135419/143843, forknum 0, blkno 101, byte offset 6152: replay 0x40 primary 0x00 2021-04-20 17:44:09.356 EDT [24106] CONTEXT: WAL redo at D/5107D8A8 for Gist/PAGE_UPDATE: ; blkref #0: rel 1663/135419/143843, blk 101 FPW (Note I modified checkXLogConsistency to PANIC on failure, so I could get a core dump to analyze; and it's also printing the first-mismatch location.) I have not analyzed each one of these failures exhaustively, but on the ones I have looked at closely, the replay_image_masked version of the page appears correct while the primary_image_masked version is *not*. Moreover, the primary_image_masked version does not match the full-page image that I see in the on-disk WAL file. It did however seem to match the in-memory WAL record contents that the decoder is operating on. So unless you want to believe the buggy-hardware theory, something's occasionally messing up while loading WAL records from disk. All of the trouble cases involve records that span across WAL pages (unsurprising since they contain FPIs), so maybe there's something not quite right in there. In the cases that I looked at closely, it appeared that there was a block of 32 wrong bytes somewhere within the page image, with the data before and after that being correct. I'm not sure if that pattern holds in all cases though. BTW, if I restart the failed standby, it plows through the same data just fine, confirming that the on-disk WAL is not corrupt. The other PPC machine (with no known history of trouble) is the one that had the CRC failure I showed earlier. That one does seem to be actual bad data in the stored WAL, because the problem was also seen by pg_waldump, and trying to restart the standby got the same failure again. I've not been able to duplicate the consistency-check failures there. But because that machine is a laptop with a much inferior disk drive, the speeds are enough different that it's not real surprising if it doesn't hit the same problem. I've also tried to reproduce on 32-bit and 64-bit Intel, without success. So if this is real, maybe it's related to being big-endian hardware? But it's also quite sensitive to $dunno-what, maybe the history of WAL records that have already been replayed. regards, tom lane