On 2020-Apr-08, Kyotaro Horiguchi wrote: > At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi > <horikyota....@gmail.com> wrote in > > Just avoiding starting replication when restart_lsn is invalid is > sufficient (the attached, which is equivalent to a part of what the > invalidated flag did). I thing that the error message needs a Hint but > it looks on the subscriber side as: > > [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL > stream: ERROR: replication slot "s1" is invalidated > HINT: The slot exceeds the limit by max_slot_wal_keep_size. > > I don't think it is not clean.. Perhaps the subscriber should remove > the trailing line of the message from the publisher?
Thanks for the fix! I propose two changes: 1. reword the error like this: ERROR: replication slot "regression_slot3" cannot be advanced DETAIL: This slot has never previously reserved WAL, or has been invalidated 2. use the same error in one other place, to wit pg_logical_slot_get_changes() and pg_replication_slot_advance(). I made the DETAIL part the same in all places, but the ERROR line is adjusted to what each callsite is doing. I do think that this change in test_decoding is a bit unpleasant: -ERROR: cannot use physical replication slot for logical decoding +ERROR: cannot get changes from replication slot "repl" The test is -- check that we're detecting a streaming rep slot used for logical decoding SELECT 'init' FROM pg_create_physical_replication_slot('repl'); SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > > On the other hand, physical replication doesn't break by invlidation. > > [...] > > If we don't mind that standby can reconnect after a walsender > termination due to the invalidation, we don't need to do something for > this. Restricting max_slot_wal_keep_size to be larger than a certain > threshold would reduce the chance we see that behavior. Yeah, I think you're referring to the fact that StartReplication() doesn't verify the restart_lsn of the slot; and if we do add a check, a few tests that rely on physical replication start to fail. This patch only adds a comment in that spot. But I don't (yet) know what the consequences of this are, or whether it can be fixed by setting a valid restart_lsn ahead of time. This test in pg_basebackup fails, for example: # Running: pg_basebackup -D /home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl -X stream -S slot1 pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR: cannot read from replication slot "slot1" DETAIL: This slot has never previously reserved WAL, or has been invalidated pg_basebackup: error: child process exited with exit code 1 pg_basebackup: removing data directory "/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl" not ok 95 - pg_basebackup -X stream with replication slot runs # Failed test 'pg_basebackup -X stream with replication slot runs' # at t/010_pg_basebackup.pl line 461. Anyway I think the current patch can be applied as is -- and if we want physical replication to have some other behavior, we can patch for that afterwards. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 74dcfd5f6384d9bbdb1d23e6d4d69fcf798c7cff Mon Sep 17 00:00:00 2001 From: Alvaro Herrera <alvhe...@alvh.no-ip.org> Date: Mon, 27 Apr 2020 17:32:28 -0400 Subject: [PATCH] check slot->restart_lsn in a couple of places --- contrib/test_decoding/expected/ddl.out | 3 ++- contrib/test_decoding/expected/slot.out | 3 ++- src/backend/replication/logical/logicalfuncs.c | 7 +++++++ src/backend/replication/slotfuncs.c | 4 +++- src/backend/replication/walsender.c | 14 +++++++++++++- 5 files changed, 27 insertions(+), 4 deletions(-) diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out index d79cd316b7..bbc8e445af 100644 --- a/contrib/test_decoding/expected/ddl.out +++ b/contrib/test_decoding/expected/ddl.out @@ -41,7 +41,8 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl'); (1 row) SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); -ERROR: cannot use physical replication slot for logical decoding +ERROR: cannot get changes from replication slot "repl" +DETAIL: This slot has never previously reserved WAL, or has been invalidated SELECT pg_drop_replication_slot('repl'); pg_drop_replication_slot -------------------------- diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out index 1000171530..b448e881d6 100644 --- a/contrib/test_decoding/expected/slot.out +++ b/contrib/test_decoding/expected/slot.out @@ -143,7 +143,8 @@ SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3'); SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN ERROR: invalid target WAL LSN SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error -ERROR: cannot advance replication slot that has not previously reserved WAL +ERROR: replication slot "regression_slot3" cannot be advanced +DETAIL: This slot has never previously reserved WAL, or has been invalidated SELECT pg_drop_replication_slot('regression_slot3'); pg_drop_replication_slot -------------------------- diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c index f5384f1df8..8412e14fe1 100644 --- a/src/backend/replication/logical/logicalfuncs.c +++ b/src/backend/replication/logical/logicalfuncs.c @@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin (void) ReplicationSlotAcquire(NameStr(*name), SAB_Error); + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot get changes from replication slot \"%s\"", + NameStr(*name)), + errdetail("This slot has never previously reserved WAL, or has been invalidated"))); + PG_TRY(); { /* restart at slot's confirmed_flush */ diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index f776de3df7..1cfe2e5893 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -598,7 +598,9 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), - errmsg("cannot advance replication slot that has not previously reserved WAL"))); + errmsg("replication slot \"%s\" cannot be advanced", + NameStr(*slotname)), + errdetail("This slot has never previously reserved WAL, or has been invalidated"))); /* * Check if the slot is not moving backwards. Physical slots rely simply diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 6c87f6087c..8373f00e80 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -600,6 +600,12 @@ StartReplication(StartReplicationCmd *cmd) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot use a logical replication slot for physical replication"))); + + /* + * We don't need to verify the slot's restart_lsn here; instead we + * rely on the caller requesting the starting point to use. If the + * WAL segment doesn't exist, we'll fail later. + */ } /* @@ -1134,6 +1140,13 @@ StartLogicalReplication(StartReplicationCmd *cmd) (void) ReplicationSlotAcquire(cmd->slotname, SAB_Error); + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot read from replication slot \"%s\"", + cmd->slotname), + errdetail("This slot has never previously reserved WAL, or has been invalidated"))); + /* * Force a disconnect, so that the decoding code doesn't need to care * about an eventual switch from running in recovery, to running in a @@ -1159,7 +1172,6 @@ StartLogicalReplication(StartReplicationCmd *cmd) WalSndPrepareWrite, WalSndWriteData, WalSndUpdateProgress); - WalSndSetState(WALSNDSTATE_CATCHUP); /* Send a CopyBothResponse message, and start streaming */ -- 2.20.1