On Mon, Aug 8, 2022 at 9:09 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > Thanks for the repro patch and bisection work. Looking...
I don't have the complete explanation yet, but it's something like this. We hit the following branch in xlogrecovery.c... if (StandbyMode && !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) { /* * Emit this error right now then retry this page immediately. Use * errmsg_internal() because the message was already translated. */ if (xlogreader->errormsg_buf[0]) ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), (errmsg_internal("%s", xlogreader->errormsg_buf))); /* reset any error XLogReaderValidatePageHeader() might have set */ xlogreader->errormsg_buf[0] = '\0'; goto next_record_is_invalid; } ... but, even though there was a (suppressed) error, nothing invalidates the reader's page buffer. Normally, XLogReadValidatePageHeader() failure or any other kind of error encountered by xlogreader.c'd decoding logic would do that, but here the read_page callback is directly calling the header validation. Without prefetching, that doesn't seem to matter, but reading ahead can cause us to have the problem page in our buffer at the wrong time, and then not re-read it when we should. Or something like that. The attached patch that simply moves the cache invalidation into report_invalid_record(), so that it's reached by the above code and everything else that reports an error, seems to fix the problem in src/bin/pg_ctl/t/003_promote.pl with Noah's spanner-in-the-works patch applied, and passes check-world without it. I need to look at this some more, though, and figure out if it's the right fix.
From 1f4fc659441a805a50b77f4e7810f9a25b5e62ae Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Mon, 8 Aug 2022 14:09:32 +1200 Subject: [PATCH] WIP Fix XLogPageRead() cache invalidation. When ReadPageInternal() reports an error, including due to failure of XLogReaderValidatePageHeader(), the reader calls XLogReaderInvalReaderState() to invalidate the contents of its buffer, forcing a re-read next time. This did not work if a header validation error was detected by the read_page callback itself. XLogPageRead() calls XLogValidatePageHeader() directly, and if that failed, there was no cache invalidation. Fix, by invalidating the cache directly in report_invalid_record(), so that XLogValidatePageHeader()'s failure invalidates the cache even with called directly by xlogrecovery.c. XXX More research/testing needed diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 06e91547dd..e6431722e7 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -81,6 +81,12 @@ report_invalid_record(XLogReaderState *state, const char *fmt,...) va_end(args); state->errormsg_deferred = true; + + /* + * Invalidate the read state. We might read from a different source after + * failure. + */ + XLogReaderInvalReadState(state); } /* @@ -1066,11 +1072,6 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen) return readLen; err: - if (state->errormsg_buf[0] != '\0') - { - state->errormsg_deferred = true; - XLogReaderInvalReadState(state); - } return XLREAD_FAIL; } -- 2.37.1