On 2020-Apr-08, Kyotaro Horiguchi wrote:

> At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi 
> <horikyota....@gmail.com> wrote in 
> 
> Just avoiding starting replication when restart_lsn is invalid is
> sufficient (the attached, which is equivalent to a part of what the
> invalidated flag did). I thing that the error message needs a Hint but
> it looks on the subscriber side as:
> 
> [22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL 
> stream: ERROR:  replication slot "s1" is invalidated
>         HINT:  The slot exceeds the limit by max_slot_wal_keep_size.
> 
> I don't think it is not clean.. Perhaps the subscriber should remove
> the trailing line of the message from the publisher?

Thanks for the fix!  I propose two changes:

1. reword the error like this:

ERROR:  replication slot "regression_slot3" cannot be advanced
DETAIL:  This slot has never previously reserved WAL, or has been invalidated

2. use the same error in one other place, to wit
   pg_logical_slot_get_changes() and pg_replication_slot_advance().  I
   made the DETAIL part the same in all places, but the ERROR line is
   adjusted to what each callsite is doing.
   I do think that this change in test_decoding is a bit unpleasant:

-ERROR:  cannot use physical replication slot for logical decoding
+ERROR:  cannot get changes from replication slot "repl"

   The test is
      -- check that we're detecting a streaming rep slot used for logical 
decoding
      SELECT 'init' FROM pg_create_physical_replication_slot('repl');
      SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1');

> > On the other hand, physical replication doesn't break by invlidation.
> > [...]
> 
> If we don't mind that standby can reconnect after a walsender
> termination due to the invalidation, we don't need to do something for
> this.  Restricting max_slot_wal_keep_size to be larger than a certain
> threshold would reduce the chance we see that behavior.

Yeah, I think you're referring to the fact that StartReplication()
doesn't verify the restart_lsn of the slot; and if we do add a check, a
few tests that rely on physical replication start to fail.  This patch
only adds a comment in that spot.  But I don't (yet) know what the
consequences of this are, or whether it can be fixed by setting a valid
restart_lsn ahead of time.  This test in pg_basebackup fails, for
example:

# Running: pg_basebackup -D 
/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl
 -X stream -S slot1
pg_basebackup: error: could not send replication command "START_REPLICATION": 
ERROR:  cannot read from replication slot "slot1"
DETAIL:  This slot has never previously reserved WAL, or has been invalidated
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory 
"/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
not ok 95 - pg_basebackup -X stream with replication slot runs

#   Failed test 'pg_basebackup -X stream with replication slot runs'
#   at t/010_pg_basebackup.pl line 461.


Anyway I think the current patch can be applied as is -- and if we want
physical replication to have some other behavior, we can patch for that
afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 74dcfd5f6384d9bbdb1d23e6d4d69fcf798c7cff Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 27 Apr 2020 17:32:28 -0400
Subject: [PATCH] check slot->restart_lsn in a couple of places

---
 contrib/test_decoding/expected/ddl.out         |  3 ++-
 contrib/test_decoding/expected/slot.out        |  3 ++-
 src/backend/replication/logical/logicalfuncs.c |  7 +++++++
 src/backend/replication/slotfuncs.c            |  4 +++-
 src/backend/replication/walsender.c            | 14 +++++++++++++-
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index d79cd316b7..bbc8e445af 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -41,7 +41,8 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
 (1 row)
 
 SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-ERROR:  cannot use physical replication slot for logical decoding
+ERROR:  cannot get changes from replication slot "repl"
+DETAIL:  This slot has never previously reserved WAL, or has been invalidated
 SELECT pg_drop_replication_slot('repl');
  pg_drop_replication_slot 
 --------------------------
diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out
index 1000171530..b448e881d6 100644
--- a/contrib/test_decoding/expected/slot.out
+++ b/contrib/test_decoding/expected/slot.out
@@ -143,7 +143,8 @@ SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
 SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
 ERROR:  invalid target WAL LSN
 SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
-ERROR:  cannot advance replication slot that has not previously reserved WAL
+ERROR:  replication slot "regression_slot3" cannot be advanced
+DETAIL:  This slot has never previously reserved WAL, or has been invalidated
 SELECT pg_drop_replication_slot('regression_slot3');
  pg_drop_replication_slot 
 --------------------------
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f5384f1df8..8412e14fe1 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 
 	(void) ReplicationSlotAcquire(NameStr(*name), SAB_Error);
 
+	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot get changes from replication slot \"%s\"",
+						NameStr(*name)),
+				 errdetail("This slot has never previously reserved WAL, or has been invalidated")));
+
 	PG_TRY();
 	{
 		/* restart at slot's confirmed_flush */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3df7..1cfe2e5893 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -598,7 +598,9 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("cannot advance replication slot that has not previously reserved WAL")));
+				 errmsg("replication slot \"%s\" cannot be advanced",
+						NameStr(*slotname)),
+				 errdetail("This slot has never previously reserved WAL, or has been invalidated")));
 
 	/*
 	 * Check if the slot is not moving backwards.  Physical slots rely simply
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6c87f6087c..8373f00e80 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -600,6 +600,12 @@ StartReplication(StartReplicationCmd *cmd)
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("cannot use a logical replication slot for physical replication")));
+
+		/*
+		 * We don't need to verify the slot's restart_lsn here; instead we
+		 * rely on the caller requesting the starting point to use.  If the
+		 * WAL segment doesn't exist, we'll fail later.
+		 */
 	}
 
 	/*
@@ -1134,6 +1140,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	(void) ReplicationSlotAcquire(cmd->slotname, SAB_Error);
 
+	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot read from replication slot \"%s\"",
+						cmd->slotname),
+				 errdetail("This slot has never previously reserved WAL, or has been invalidated")));
+
 	/*
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
@@ -1159,7 +1172,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 							  WalSndPrepareWrite, WalSndWriteData,
 							  WalSndUpdateProgress);
 
-
 	WalSndSetState(WALSNDSTATE_CATCHUP);
 
 	/* Send a CopyBothResponse message, and start streaming */
-- 
2.20.1

Reply via email to