Hi,

On 11/7/23 11:55 AM, Amit Kapila wrote:
On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand
<bertranddrouvot...@gmail.com> wrote:

On 10/31/23 10:37 AM, shveta malik wrote:
On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand

Agree with your test case, but in my case I was not using pub/sub.

I was not clear, so when I said:

- create logical_slot1 on the primary (and don't start using it)

I meant don't start decoding from it (like using pg_recvlogical() or
pg_logical_slot_get_changes()).

By using pub/sub the "don't start using it" is not satisfied.

My test case is:

"
SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 
'test_decoding', false, true, true);
SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 
'test_decoding', false, true, true);
pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f -
"


Okay, I am able to reproduce it now. Thanks for clarification. I have
tried to change the algorithm as per suggestion by Amit in [1]

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com

Thanks!


This is not full proof solution but optimization over first one. Now
in any sync-cycle, we take 2 attempts for slots-creation (if any slots
are available to be created). In first attempt, we do not wait
indefinitely on inactive slots, we wait only for a fixed amount of
time and if remote-slot is still behind, then we add that to the
pending list and move to the next slot. Once we are done with first
attempt, in second attempt, we go for the pending ones and now we wait
on each of them until the primary catches up.

Aren't we "just" postponing the "issue"? I mean if there is really no activity
on, say, the first created slot, then once we move to the second attempt then 
any newly
created slot from that time would wait to be synced forever, no?


We have to wait at some point in time for such inactive slots and the
same is true even for manually created slots on standby. Do you have
any better ideas to deal with it?


What about:

- get rid of the second attempt and the pending_slot_list
- keep the wait_count and PrimaryCatchupWaitAttempt logic

so basically, get rid of:

   /*
    * Now sync the pending slots which were failed to be created in first
    * attempt.
    */
   foreach(cell, pending_slot_list)
   {
       RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell);

       /* Wait until the primary server catches up */
       PrimaryCatchupWaitAttempt = 0;

       synchronize_one_slot(wrconn, remote_slot, NULL);
   }

and the pending_slot_list list.

That way, for each slot that have not been created and synced yet:

- it will be created on the standby
- we will wait up to PrimaryCatchupWaitAttempt attempts
- the slot will be synced or removed on/from the standby

That way an inactive slot on the primary would not "block"
any other slots on the standby.

By "created" here I mean calling ReplicationSlotCreate() (not to be confused
with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", 
remote_slot->name)); "
which is confusing as mentioned up-thread).

The problem I can see with this proposal is that the "sync" window waiting
for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt
attempts (as the slot will be dropped/recreated).

If we think this window is too short we could:

- increase it
or
- don't drop the slot once created (even if there is no activity
on the primary during PrimaryCatchupWaitAttempt attempts) so that
the next loop of attempts will compare with "older" LSN/xmin (as compare to
dropping and re-creating the slot). That way the window would be since the
initial slot creation.

Thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to