On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > Dear Shveta, > > Thank you for updating the patch! > > I found another ERROR due to the slot removal. Is this a real issue? > > 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked > and the primary crashed during the > initial sync. > 2. executed test_0925_v2.sh (You attached in [1]) > 3. secondary could not start the logical replication because the slot was not > created (log files were also attached). > > > Here is my analysis. The cause is that the slotsync worker aborts the slot > creation > on secondary server because the restart_lsn of secondary ahead the primary's > one. > IIUC it can be occurred when tablesync workers finishes initial copy before > walsenders stream changes. In this case, the relstate of the worker is set to > SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes > SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until > the relation is caught up. If some changes are come and the primary crashes at > that time, the syncslot worker will abort the slot creation. >
Kuroda-San, we need to let slot-creation on standby finish before we start expecting it to support logical replication on failover. In the current case, as you stated the slot-creation itself is aborted and thus it can not support logical-replication later. We are currently trying to think of possibilities to advance remote_lsn on primary internally by slot-sync workers in order to accelerate slot-creation on standby for cases where slot-creation is stuck due to primary's restart_lsn lagging behind standby's restart_lsn. But till then, the way to proceed for testing is to execute workload on primary for such cases in order to accelerate slot-creation. thanks Shveta