Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Heikki Linnakangas Mon, 23 Apr 2018 00:43:18 -0700

On 18/01/18 20:54, Kyotaro HORIGUCHI wrote:

At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <[email protected]> wrote in 
<[email protected]>

On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:

b) The second patch that I would like to mention is doing things on the
standby side by being able to change a WAL source when fetching a single
record so as it is possible to fetch the beginning of a cross-segment
record from say an archive or the local xlog, and then request the rest
on the primary. This patch can be found in
https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
it seems to me that this actually breaks timeline switch
consistency. The concept of switching a WAL source in the middle of a
WAL segment is risky anyway, because we perform sanity checks while
switching sources. The bug we are dealing about here is very rare, and
seeing a problem like that for a timeline switch is even more rare, but
the risk is not zero and we may finish with a corrupted instance, so we
should not in my opinion take this approach. Also this actually achieves
point 2) above, not completely though, but not 1).


I don't agree with the sentiment about the approach, but I agree there
might be weaknesses in the implementation (which I have not reviewed). I
think it's perfectly sensible to require fetching one segment from
pg_xlog and the next one via another method. Currently the inability to
do so often leads to us constantly refetching the same segment.


Though I'm still not fully confident, if reading a set of
continuation records from two (or more) sources can lead to
inconsistency, two successve (complete) records are facing the
same risk.


This seems like the best approach to me as well.

Looking at the patch linked above(https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11693,6 +11693,10 @@ retry:
        Assert(reqLen <= readLen);

*readTLI = curFileTLI;

+
+       if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+               goto next_record_is_invalid;
+
        return readLen;

next_record_is_invalid:

What is the point of adding this XLogReaderValidatePageHeader() call?The caller of this callback function, ReadPageInternal(), checks thepage header anyway. Earlier in this thread, you said:

The rest to do is let XLogPageRead retry other sources
immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
public (and renamed to XLogReaderValidatePageHeader).


but I don't understand that.

- Heikki

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Reply via email to