On Mon, May 20, 2024 at 4:30 PM Shlok Kyal <shlok.kyal....@gmail.com> wrote: > > > > I was trying to test this utility when 'sync_replication_slots' is on > > and it gets in an ERROR loop [1] and never finishes. Please find the > > postgresql.auto used on the standby attached. I think if the standby > > has enabled sync_slots, you need to pass dbname in > > GenerateRecoveryConfig(). I couldn't test it further but I wonder if > > there are already synced slots on the standby (either due to > > 'sync_replication_slots' or users have used > > pg_sync_replication_slots() before invoking pg_createsubscriber), > > those would be retained as it is on new subscriber and lead to > > unnecessary WAL retention and dead rows. > > > > [1] > > 2024-04-30 11:50:43.239 IST [12536] LOG: slot sync worker started > > 2024-04-30 11:50:43.247 IST [12536] ERROR: slot synchronization > > requires dbname to be specified in primary_conninfo > > Hi, > > I tested the scenario posted by Amit in [1], in which retaining synced > slots lead to unnecessary WAL retention and ERROR. This is raised as > the second open point in [2]. > The steps to reproduce the issue: > (1) Setup physical replication with sync slot feature turned on by > setting sync_replication_slots = 'true' or using > pg_sync_replication_slots() on the standby node. > For physical replication setup, run pg_basebackup with -R and -d option. > (2) Create a logical replication slot on primary node with failover > option as true. A corresponding slot is created on standby as part of > sync slot feature. > (3) Run pg_createsubscriber on standby node. > (4) On Checking for the replication slot on standby node, I noticed > that the logical slots created in step 2 are retained. > I have attached the script to reproduce the issue. > > I and Kuroda-san worked to resolve open points. Here are patches to > solve the second and third point in [2]. > Patches proposed by Euler are also attached just in case, but they > were not modified. > > v2-0001: not changed >
Shouldn't we modify it as per the suggestion given in the email [1]? I am wondering if we can entirely get rid of checking the primary business and simply rely on recovery_timeout and keep checking server_is_in_recovery(). If so, we can modify the test to use non-default recovery_timeout (say 180s or something similar if we have used it at any other place). As an additional check we can ensure that constent_lsn is present on standby. > v2-0002: not changed > We have added more tries to see if the primary_slot_name becomes active but I think it is still fragile because it is possible on slow machines that the required slot didn't become active even after more retries. I have raised the same comment previously [2] and asked an additional question but didn't get any response. [1] - https://www.postgresql.org/message-id/CAA4eK1JJq_ER6Kq_H%3DjKHR75QPRd8y9_D%3DRtYw%3DaPYKMfqLi9A%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1LT3Z13Dg6p4Z%2B4adO_EY-ow5CmWfikEmBfL%3DeVrm8CPw%40mail.gmail.com -- With Regards, Amit Kapila.