On Tue, Jan 30, 2024 at 9:53 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 4:06 PM shveta malik <shveta.ma...@gmail.com> wrote: > > > > PFA v73-0001 which addresses the above comments. Other patches will be > > rebased and posted after pushing this one. > > Since v73-0001 is pushed, PFA rest of the patches. Changes are: > > 1) Rebased the patches. > 2) Ran pg_indent on all. > 3) patch001: Updated logicaldecoding.sgml for dbname requirement in > primary_conninfo for slot-synchronization. >
Thank you for updating the patches. As for the slotsync worker patch, is there any reason why 0001, 0002, and 0004 patches are still separated? Beside, here are some comments on v74 0001, 0002, and 0004 patches: --- +static char * +wait_for_valid_params_and_get_dbname(void) +{ + char *dbname; + int rc; + + /* Sanity check. */ + Assert(enable_syncslot); + + for (;;) + { + if (validate_parameters_and_get_dbname(&dbname)) + break; + ereport(LOG, errmsg("skipping slot synchronization")); + + ProcessSlotSyncInterrupts(NULL); When reading this function, I expected that the slotsync worker would resume working once the parameters became valid, but it was not correct. For example, if I changed hot_standby_feedback from off to on, the slotsync worker reads the config file, exits, and then restarts. Given that the slotsync worker ends up exiting on parameter changes anyway, why do we want to have it wait for parameters to become valid? IIUC even if the slotsync worker exits when a parameter is not valid, it restarts at some intervals. --- +bool +SlotSyncWorkerCanRestart(void) +{ +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 + IIUC depending on how busy the postmaster is and the timing, the user could wait for 1 min to re-launch the slotsync worker. But I think the user might want to re-launch the slotsync worker more quickly for example when the slotsync worker restarts due to parameter changes. IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the slotsync worker previously exited with 0 or 1. --- + /* We are a normal standby */ + valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); What do you mean by "normal standby"? --- + appendStringInfo(&cmd, + "SELECT pg_is_in_recovery(), count(*) = 1" + " FROM pg_replication_slots" + " WHERE slot_type='physical' AND slot_name=%s", + quote_literal_cstr(PrimarySlotName)); I think we need to make "pg_replication_slots" schema-qualified. --- + errdetail("The primary server slot \"%s\" specified by" + " \"%s\" is not valid.", + PrimarySlotName, "primary_slot_name")); and + errmsg("slot sync worker will shutdown because" + " %s is disabled", "enable_syncslot")); It's better to write it in one line for better greppability. --- When I dropped a database on the primary that has a failover slot, I got the following logs on the standby: 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is active for PID 1103935 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 for Database/DROP: dir 1663/16384 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID 1103933) exited with exit code 1 It seems that because the slotsync worker created the slot on the standby, the slot's active_pid is still valid. That is why the startup process could not drop the slot. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com