Re: Synchronizing slots from primary to standby

shveta malik Tue, 03 Oct 2023 22:54:44 -0700

On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu)
<[email protected]> wrote:
>
> Dear Shveta,
>
> Thank you for updating the patch!
>
> I found another ERROR due to the slot removal. Is this a real issue?
>
> 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked
>    and the primary crashed during the
>    initial sync.
> 2. executed test_0925_v2.sh (You attached in [1])
> 3. secondary could not start the logical replication because the slot was not
>    created (log files were also attached).
>
>
> Here is my analysis. The cause is that the slotsync worker aborts the slot 
> creation
> on secondary server because the restart_lsn of secondary ahead the primary's 
> one.
> IIUC it can be occurred when tablesync workers finishes initial copy before
> walsenders stream changes. In this case, the relstate of the worker is set to
> SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes
> SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until
> the relation is caught up. If some changes are come and the primary crashes at
> that time, the syncslot worker will abort the slot creation.
>


Kuroda-San, we need to let slot-creation on standby finish before we
start expecting it to support logical replication on failover. In the
current case, as you stated the slot-creation itself is aborted and
thus it can not support logical-replication later.  We are currently
trying to think of possibilities to advance remote_lsn on primary
internally by slot-sync workers in order to accelerate slot-creation
on standby for cases where slot-creation is stuck due to primary's
restart_lsn lagging behind standby's restart_lsn. But till then, the
way to proceed for testing is to execute workload on primary for such
cases in order to accelerate slot-creation.

thanks
Shveta

Re: Synchronizing slots from primary to standby

Reply via email to