At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > > I pushed version 26, with a few further adjustments. > > > > I think what we have now is sufficient, but if you want to attempt this > > "invalidated" flag on top of what I pushed, be my guest. > > I don't think the invalidation flag is essential but it can prevent > unanticipated behavior, in other words, it makes us feel at ease:p > > After the current master/HEAD, the following steps causes assertion > failure in xlogreader.c. .. > I will look at it.
Just avoiding starting replication when restart_lsn is invalid is sufficient (the attached, which is equivalent to a part of what the invalidated flag did). I thing that the error message needs a Hint but it looks on the subscriber side as: [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated HINT: The slot exceeds the limit by max_slot_wal_keep_size. I don't think it is not clean.. Perhaps the subscriber should remove the trailing line of the message from the publisher? > On the other hand, physical replication doesn't break by invlidation. > > Primary: postgres.conf > max_slot_wal_keep_size=0 > Standby: postgres.conf > primary_conninfo='connect to master' > primary_slot_name='x1' > > (start the primary) > P=> select pg_create_physical_replication_slot('x1'); > (start the standby) > S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint; If we don't mind that standby can reconnect after a walsender termination due to the invalidation, we don't need to do something for this. Restricting max_slot_wal_keep_size to be larger than a certain threshold would reduce the chance we see that behavior. I saw another issue, the following sequence on the primary freezes when invalidation happens. =# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select pg_switch_wal(); checkpoint; The last checkpoint command is waiting for CV on CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer is waiting for the next latch at the end of CheckpointerMain. new_started doesn't move but it is the same value with old_started. That freeze didn't happen when I removed ConditionVariableSleep(&s->active_cv) in InvalidateObsoleteReplicationSlots. I continue investigating it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
>From b3f7e2d94b8ea9b5f3819fcf47c0e1ba57355b87 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga....@gmail.com> Date: Wed, 8 Apr 2020 14:03:01 +0900 Subject: [PATCH] walsender crash fix --- src/backend/replication/walsender.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 06e8b79036..707de65f4b 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1170,6 +1170,13 @@ StartLogicalReplication(StartReplicationCmd *cmd) pq_flush(); /* Start reading WAL from the oldest required WAL. */ + if (MyReplicationSlot->data.restart_lsn == InvalidXLogRecPtr) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("replication slot \"%s\" is invalidated", + cmd->slotname), + errhint("The slot exceeds the limit by max_slot_wal_keep_size."))); + XLogBeginRead(logical_decoding_ctx->reader, MyReplicationSlot->data.restart_lsn); -- 2.18.2