On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <[email protected]> wrote:
> Nothing that uses xlogreader is going to be able to bridge the gap
> between file #4 and file #5. In this case it doesn't matter very much,
> because we immediately write a checkpoint record into file #5, so if
> we crash we won't try to replay file #4 anyway. However, if anything
> did try to look at file #4 it would get confused. Maybe that can
> happen if this is a streaming standby, where we only write an
> end-of-recovery record upon promotion, rather than a checkpoint, or
> maybe if there are cascading standbys someone could try to actually
> use the 000000020000000000000004 file for something. I'm not sure. But
> unless I'm missing something, that file is bogus, and our only hope of
> not having problems is that perhaps no one will ever look at it.
Yeah, this analysis looks correct to me.
> I think that the cause of this problem is this code right here:
>
> /*
> * Actually, if WAL ended in an incomplete record, skip the parts that
> * made it through and start writing after the portion that persisted.
> * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> * we'll do as soon as we're open for writing new WAL.)
> */
> if (!XLogRecPtrIsInvalid(missingContrecPtr))
> {
> Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> EndOfLog = missingContrecPtr;
> }
Yeah, this statement as well as another statement that creates the
overwrite contrecord. After changing these two lines the problem is
fixed for me. Although I haven't yet thought of all the scenarios
that whether it is safe in all the cases. I agree that after timeline
changes we are pointing to the end of the last valid record we can
start writing the next record from that point onward. But I think we
should need to think hard that whether it will break any case for
which the overwrite contrecord was actually introduced.
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 7602fc8..3d38613 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5491,7 +5491,7 @@ StartupXLOG(void)
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
*/
- if (!XLogRecPtrIsInvalid(missingContrecPtr))
+ if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr))
{
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
EndOfLog = missingContrecPtr;
@@ -5589,7 +5589,7 @@ StartupXLOG(void)
LocalSetXLogInsertAllowed();
/* If necessary, write overwrite-contrecord before doing
anything else */
- if (!XLogRecPtrIsInvalid(abortedRecPtr))
+ if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(abortedRecPtr))
{
Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
CreateOverwriteContrecordRecord(abortedRecPtr,
missingContrecPtr, newTLI);
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com