Thank you for committing this. At Tue, 7 Apr 2020 18:45:22 -0400, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote in > On 2020-Apr-07, Kyotaro Horiguchi wrote: > > > > Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot? > > > > I did that in the attached. The invalidated is shared-but-not-saved > > member of a slot and initialized to false then irreversibly changed to > > true when the slot loses required segment. > > > > It is checked by the new function CheckReplicationSlotInvalidated() at > > acquireing a slot and at updating slot by standby reply message. This > > change stops walsender without explicitly killing but I didn't remove > > that code. > > This change didn't work well with my proposed change to make > checkpointer acquire slots before marking them invalid. When I > incorporated your patch in the last version I posted yesterday, there > was a problem that when checkpointer attempted to acquire the slot, it > would fail with "the slot is invalidated"; also if you try to drop the > slot, it would obviously fail. I think it would work to remove the > SlotIsInvalidated check from the Acquire routine, and instead move it to > the routines that need it (ie. not the InvalidateObsolete one, and also > not the routine to drop slots). > > I pushed version 26, with a few further adjustments. > > I think what we have now is sufficient, but if you want to attempt this > "invalidated" flag on top of what I pushed, be my guest.
I don't think the invalidation flag is essential but it can prevent unanticipated behavior, in other words, it makes us feel at ease:p After the current master/HEAD, the following steps causes assertion failure in xlogreader.c. P(ublisher) $ vi $PGDATA/postgresql.conf wal_level=logical max_slot_wal_keep_size=0 ^Z (start publisher and subscriber) P=> create table t(a int); P=> create publication p1 for table t; S=> create table t(a int); P=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint; (publisher crashes) 2020-04-08 09:20:16.893 JST [9582] LOG: invalidating slot "s1" because its restart_lsn 0/1571770 exceeds max_slot_wal_keep_size 2020-04-08 09:20:16.897 JST [9496] LOG: database system is ready to accept connections 2020-04-08 09:20:21.472 JST [9597] LOG: starting logical decoding for slot "s1" 2020-04-08 09:20:21.472 JST [9597] DETAIL: Streaming transactions committing after 0/1571770, reading WAL from 0/0. TRAP: FailedAssertion("!XLogRecPtrIsInvalid(RecPtr)", File: "xlogreader.c", Line: 235) postgres: walsender horiguti [local] idle(ExceptionalCondition+0xa8)[0xaac4c1] postgres: walsender horiguti [local] idle(XLogBeginRead+0x30)[0x588dbf] postgres: walsender horiguti [local] idle[0x8c938b] postgres: walsender horiguti [local] idle(exec_replication_command+0x311)[0x8c9c75] postgres: walsender horiguti [local] idle(PostgresMain+0x79a)[0x92f091] postgres: walsender horiguti [local] idle[0x87eec3] postgres: walsender horiguti [local] idle[0x87e69a] postgres: walsender horiguti [local] idle[0x87abc2] postgres: walsender horiguti [local] idle(PostmasterMain+0x11cd)[0x87a48f] postgres: walsender horiguti [local] idle[0x7852cb] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fc190958873] postgres: walsender horiguti [local] idle(_start+0x2e)[0x48169e] 2020-04-08 09:20:22.255 JST [9496] LOG: server process (PID 9597) was terminated by signal 6: Aborted 2020-04-08 09:20:22.255 JST [9496] LOG: terminating any other active server processes 2020-04-08 09:20:22.256 JST [9593] WARNING: terminating connection because of crash of another server process 2020-04-08 09:20:22.256 JST [9593] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. I will look at it. On the other hand, physical replication doesn't break by invlidation. Primary: postgres.conf max_slot_wal_keep_size=0 Standby: postgres.conf primary_conninfo='connect to master' primary_slot_name='x1' (start the primary) P=> select pg_create_physical_replication_slot('x1'); (start the standby) S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint; (primary log) 2020-04-08 09:35:09.719 JST [10064] LOG: terminating walsender 10076 because replication slot "x1" is too far behind 2020-04-08 09:35:09.719 JST [10076] FATAL: terminating connection due to administrator command 2020-04-08 09:35:09.720 JST [10064] LOG: invalidating slot "x1" because its restart_lsn 0/B9F2000 exceeds max_slot_wal_keep_size (standby) [10075] 2020-04-08 09:35:09.723 JST FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. [10101] 2020-04-08 09:35:09.734 JST LOG: started streaming WAL from primary at 0/C000000 on timeline 1 Doesn't harm but something's strange. I'll look it, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center