Re: [HACKERS] Restricting maximum keep segments by repslots

Kyotaro Horiguchi Tue, 07 Apr 2020 22:20:45 -0700

At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi 
<horikyota....@gmail.com> wrote in 
> > I pushed version 26, with a few further adjustments.
> > 
> > I think what we have now is sufficient, but if you want to attempt this
> > "invalidated" flag on top of what I pushed, be my guest.
> 
> I don't think the invalidation flag is essential but it can prevent
> unanticipated behavior, in other words, it makes us feel at ease:p
> 
> After the current master/HEAD, the following steps causes assertion
> failure in xlogreader.c.
..
> I will look at it.


Just avoiding starting replication when restart_lsn is invalid is
sufficient (the attached, which is equivalent to a part of what the
invalidated flag did). I thing that the error message needs a Hint but
it looks on the subscriber side as:

[22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL 
stream: ERROR:  replication slot "s1" is invalidated
        HINT:  The slot exceeds the limit by max_slot_wal_keep_size.

I don't think it is not clean.. Perhaps the subscriber should remove
the trailing line of the message from the publisher?

> On the other hand, physical replication doesn't break by invlidation.
> 
> Primary: postgres.conf
> max_slot_wal_keep_size=0
> Standby: postgres.conf
> primary_conninfo='connect to master'
> primary_slot_name='x1'
> 
> (start the primary)
> P=> select pg_create_physical_replication_slot('x1');
> (start the standby)
> S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

If we don't mind that standby can reconnect after a walsender
termination due to the invalidation, we don't need to do something for
this.  Restricting max_slot_wal_keep_size to be larger than a certain
threshold would reduce the chance we see that behavior.

I saw another issue, the following sequence on the primary freezes
when invalidation happens.

=# create table tt(); drop table tt; select pg_switch_wal();create table tt(); 
drop table tt; select pg_switch_wal();create table tt(); drop table tt; select 
pg_switch_wal(); checkpoint;

The last checkpoint command is waiting for CV on
CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
is waiting for the next latch at the end of
CheckpointerMain. new_started doesn't move but it is the same value
with old_started.

That freeze didn't happen when I removed
ConditionVariableSleep(&s->active_cv) in
InvalidateObsoleteReplicationSlots.

I continue investigating it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From b3f7e2d94b8ea9b5f3819fcf47c0e1ba57355b87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga....@gmail.com>
Date: Wed, 8 Apr 2020 14:03:01 +0900
Subject: [PATCH] walsender crash fix

---
 src/backend/replication/walsender.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 06e8b79036..707de65f4b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1170,6 +1170,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 	pq_flush();
 
 	/* Start reading WAL from the oldest required WAL. */
+	if (MyReplicationSlot->data.restart_lsn == InvalidXLogRecPtr)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("replication slot \"%s\" is invalidated",
+						cmd->slotname),
+				 errhint("The slot exceeds the limit by max_slot_wal_keep_size.")));
+
 	XLogBeginRead(logical_decoding_ctx->reader,
 				  MyReplicationSlot->data.restart_lsn);
 
-- 
2.18.2

Re: [HACKERS] Restricting maximum keep segments by repslots

Reply via email to