Hi, As detailed in https://postgr.es/m/20190319200050.ncuxejradurjakdc%40alap3.anarazel.de the way the backend's basebackup checksum verification works makes its error detection capabilities very dubious.
I think we need to fix this before the next set of backbranch releases, or at the very least add a big fat warning that the feature isn't doing much. The more I think about it, the less convinced I am of the method to avoid the torn page problem using LSNs. To make e.g. the PageIsNew() check correct, we need to verify that the whole page is zeroes - but we can't use the LSN for that, as it's not on the page. But there very well could be a torn page issue with only the second half of the page being written back (with the default 8kb pages that can trivially happen as the kernel's pagecache commonly uses 4kb granularity). I basically think it needs to work like this: 1) Perform the entire set of PageIsVerified() checks *without* previously checking the page's LSN, but don't error out. 2) If 1) errored out, ensure that that's because the backend is currently writing out the page. That basically requires doing what BufferAlloc() does. So I think we'd need to end up with a function like: LockoutBackendWrites(): buf = ReadBufferWithoutRelcache(relfilenode); LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE); /* * Reread page from OS, and recheck. This needs to happen while * the IO lock prevents rereading from the OS. Note that we do * not want to rely on the buffer contents here, as that could * be very old cache contents. */ perform_checksum_check(relfilenode, ERROR); LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE); ReleaseBuffer(buf); 3) If 2) also failed, then we can be sure that the page is truly corrupted. Greetings, Andres Freund