On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmh...@gmail.com> wrote:
> Nothing that uses xlogreader is going to be able to bridge the gap > between file #4 and file #5. In this case it doesn't matter very much, > because we immediately write a checkpoint record into file #5, so if > we crash we won't try to replay file #4 anyway. However, if anything > did try to look at file #4 it would get confused. Maybe that can > happen if this is a streaming standby, where we only write an > end-of-recovery record upon promotion, rather than a checkpoint, or > maybe if there are cascading standbys someone could try to actually > use the 000000020000000000000004 file for something. I'm not sure. But > unless I'm missing something, that file is bogus, and our only hope of > not having problems is that perhaps no one will ever look at it. Yeah, this analysis looks correct to me. > I think that the cause of this problem is this code right here: > > /* > * Actually, if WAL ended in an incomplete record, skip the parts that > * made it through and start writing after the portion that persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > */ > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > { > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > EndOfLog = missingContrecPtr; > } Yeah, this statement as well as another statement that creates the overwrite contrecord. After changing these two lines the problem is fixed for me. Although I haven't yet thought of all the scenarios that whether it is safe in all the cases. I agree that after timeline changes we are pointing to the end of the last valid record we can start writing the next record from that point onward. But I think we should need to think hard that whether it will break any case for which the overwrite contrecord was actually introduced. diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7602fc8..3d38613 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5491,7 +5491,7 @@ StartupXLOG(void) * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ - if (!XLogRecPtrIsInvalid(missingContrecPtr)) + if (newTLI == endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; @@ -5589,7 +5589,7 @@ StartupXLOG(void) LocalSetXLogInsertAllowed(); /* If necessary, write overwrite-contrecord before doing anything else */ - if (!XLogRecPtrIsInvalid(abortedRecPtr)) + if (newTLI == endOfRecoveryInfo->lastRecTLI && !XLogRecPtrIsInvalid(abortedRecPtr)) { Assert(!XLogRecPtrIsInvalid(missingContrecPtr)); CreateOverwriteContrecordRecord(abortedRecPtr, missingContrecPtr, newTLI); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com