Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Andres Freund Mon, 04 Sep 2017 00:06:06 -0700

Hi,

On 2017-09-04 15:51:51 +0900, Kyotaro HORIGUCHI wrote:
>               SpinLockAcquire(&walsnd->mutex);
> +             oldFlushPtr = walsnd->flush;
>               walsnd->write = writePtr;
>               walsnd->flush = flushPtr;
>               walsnd->apply = applyPtr;
> @@ -1821,7 +1836,93 @@ ProcessStandbyReplyMessage(void)
>               if (SlotIsLogical(MyReplicationSlot))
>                       LogicalConfirmReceivedLocation(flushPtr);
>               else
> -                     PhysicalConfirmReceivedLocation(flushPtr);
> +             {
> +                     /*
> +                      * Recovery on standby requires that a continuation 
> record is
> +                      * available from single WAL source. For the reason, 
> physical
> +                      * replication slot should stay in the first segment of 
> the
> +                      * multiple segments that a continued record is spanning
> +                      * over. Since we look pages and don't look into 
> individual record
> +                      * here, restartLSN may stay a bit too behind but it 
> doesn't
> +                      * matter.
> +                      *
> +                      * Since the objective is avoiding to remove required 
> segments,
> +                      * checking at the beginning of every segment is 
> enough. But once
> +                      * restartLSN goes behind, check every page for quick 
> restoration.
> +                      *
> +                      * restartLSN has a valid value only when it is behind 
> flushPtr.
> +                      */
> +                     bool    check_contrecords = false;
> +
> +                     if (restartLSN != InvalidXLogRecPtr)
> +                     {
> +                             /*
> +                              * We're retarding restartLSN, check 
> continuation records
> +                              * at every page boundary for quick restoration.
> +                              */
> +                             if (restartLSN / XLOG_BLCKSZ != flushPtr / 
> XLOG_BLCKSZ)
> +                                     check_contrecords = true;
> +                     }
> +                     else if (oldFlushPtr != InvalidXLogRecPtr)
> +                     {
> +                             /*
> +                              * We're not retarding restartLSN , check 
> continuation records
> +                              * only at segment boundaries. No need to check 
> if
> +                              */
> +                             if (oldFlushPtr / XLOG_SEG_SIZE != flushPtr / 
> XLOG_SEG_SIZE)
> +                                     check_contrecords = true;
> +                     }
> +
> +                     if (check_contrecords)
> +                     {
> +                             XLogRecPtr rp;
> +
> +                             if (restartLSN == InvalidXLogRecPtr)
> +                                     restartLSN = oldFlushPtr;
> +
> +                             rp = restartLSN - (restartLSN % XLOG_BLCKSZ);
> +
> +                             /*
> +                              * We may have let the record at flushPtr be 
> sent, so it's
> +                              * worth looking
> +                              */
> +                             while (rp <= flushPtr)
> +                             {
> +                                     XLogPageHeaderData header;
> +
> +                                     /*
> +                                      * If the page header is not available 
> for now, don't move
> +                                      * restartLSN forward. We can read it 
> by the next chance.
> +                                      */
> +                                     if(sentPtr - rp >= 
> sizeof(XLogPageHeaderData))
> +                                     {
> +                                             bool found;
> +                                             /*
> +                                              * Fetch the page header of the 
> next page. Move
> +                                              * restartLSN forward only if 
> it is not a continuation
> +                                              * page.
> +                                              */
> +                                             found = XLogRead((char 
> *)&header, rp,
> +                                                                             
>          sizeof(XLogPageHeaderData), true);
> +                                             if (found &&
> +                                                     (header.xlp_info & 
> XLP_FIRST_IS_CONTRECORD) == 0)
> +                                                     restartLSN = rp;
> +                                     }
> +                                     rp += XLOG_BLCKSZ;
> +                             }
> +
> +                             /*
> +                              * If restartLSN is on the same page with 
> flushPtr, it means
> +                              * that we are catching up.
> +                              */
> +                             if (restartLSN / XLOG_BLCKSZ == flushPtr / 
> XLOG_BLCKSZ)
> +                                     restartLSN = InvalidXLogRecPtr;
> +                     }
> +
> +                     /* restartLSN == InvalidXLogRecPtr means catching up */
> +                     PhysicalConfirmReceivedLocation(restartLSN != 
> InvalidXLogRecPtr ?
> +                                                                             
>         restartLSN : flushPtr);
> +             }


I've not read through the thread, but this seems like the wrong approach
to me. The receiving side should use a correct value, instead of putting
this complexity on the sender's side.

Don't we also use the value on the receiving side to delete old segments
and such?

- Andres


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Reply via email to