Re: speed up a logical replica setup

Amit Kapila Wed, 22 May 2024 04:20:21 -0700

On Mon, May 20, 2024 at 4:30 PM Shlok Kyal <[email protected]> wrote:
> >
> > I was trying to test this utility when 'sync_replication_slots' is on
> > and it gets in an ERROR loop [1] and never finishes. Please find the
> > postgresql.auto used on the standby attached. I think if the standby
> > has enabled sync_slots, you need to pass dbname in
> > GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> > there are already synced slots on the standby (either due to
> > 'sync_replication_slots' or users have used
> > pg_sync_replication_slots() before invoking pg_createsubscriber),
> > those would be retained as it is on new subscriber and lead to
> > unnecessary WAL retention and dead rows.
> >
> > [1]
> > 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> > 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> > requires dbname to be specified in primary_conninfo
>
> Hi,
>
> I tested the scenario posted by Amit in [1], in which retaining synced
> slots lead to unnecessary WAL retention and ERROR. This is raised as
> the second open point in [2].
> The steps to reproduce the issue:
> (1) Setup physical replication with sync slot feature turned on by
> setting sync_replication_slots = 'true' or using
> pg_sync_replication_slots() on the standby node.
> For physical replication setup, run pg_basebackup with -R  and -d option.
> (2) Create a logical replication slot on primary node with failover
> option as true. A corresponding slot is created on standby as part of
> sync slot feature.
> (3) Run pg_createsubscriber on standby node.
> (4) On Checking for the replication slot on standby node, I noticed
> that the logical slots created in step 2 are retained.
>  I have attached the script to reproduce the issue.
>
> I and Kuroda-san worked to resolve open points. Here are patches to
> solve the second and third point in [2].
> Patches proposed by Euler are also attached just in case, but they
> were not modified.
>
> v2-0001: not changed
>


Shouldn't we modify it as per the suggestion given in the email [1]? I
am wondering if we can entirely get rid of checking the primary
business and simply rely on recovery_timeout and keep checking
server_is_in_recovery(). If so, we can modify the test to use
non-default recovery_timeout (say 180s or something similar if we have
used it at any other place). As an additional check we can ensure that
constent_lsn is present on standby.

> v2-0002: not changed
>

We have added more tries to see if the primary_slot_name becomes
active but I think it is still fragile because it is possible on slow
machines that the required slot didn't become active even after more
retries. I have raised the same comment previously [2] and asked an
additional question but didn't get any response.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JJq_ER6Kq_H%3DjKHR75QPRd8y9_D%3DRtYw%3DaPYKMfqLi9A%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1LT3Z13Dg6p4Z%2B4adO_EY-ow5CmWfikEmBfL%3DeVrm8CPw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

Reply via email to