On Friday, March 1, 2024 12:23 PM Peter Smith <[email protected]> wrote: > > Here are some review comments for v102-0001. > > ====== > doc/src/sgml/config.sgml > > 1. > + <para> > + Lists the streaming replication standby server slot names that > logical > + WAL sender processes will wait for. Logical WAL sender processes will > + send decoded changes to plugins only after the specified replication > + slots confirm receiving WAL. This guarantees that logical replication > + slots with failover enabled do not consume changes until those > changes > + are received and flushed to corresponding physical standbys. If a > + logical replication connection is meant to switch to a physical > standby > + after the standby is promoted, the physical replication slot for the > + standby should be listed here. Note that logical replication will not > + proceed if the slots specified in the standby_slot_names do > not exist or > + are invalidated. > + </para> > > Should this also mention the effect this GUC has on those 2 SQL functions? > E.g. > Commit message says: > > Additionally, The SQL functions pg_logical_slot_get_changes and > pg_replication_slot_advance are modified to wait for the replication slots > mentioned in 'standby_slot_names' to catch up before returning.
Added.
>
> ======
> src/backend/replication/slot.c
>
> 2. validate_standby_slots
>
> + else if (!ReplicationSlotCtl)
> + {
> + /*
> + * We cannot validate the replication slot if the replication slots'
> + * data has not been initialized. This is ok as we will validate the
> + * specified slot when waiting for them to catch up. See
> + * StandbySlotsHaveCaughtup for details.
> + */
> + }
> + else
> + {
> + /*
> + * If the replication slots' data have been initialized, verify if the
> + * specified slots exist and are logical slots.
> + */
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>
> IMO that 2nd comment does not need to say "If the replication slots'
> data have been initialized," because that is implicit from the if/else.
Removed.
>
> ~~~
>
> 3. GetStandbySlotList
>
> +List *
> +GetStandbySlotList(void)
> +{
> + if (RecoveryInProgress())
> + return NIL;
> + else
> + return standby_slot_names_list;
> +}
>
> The 'else' is not needed. IMO is better without it, but YMMV.
Removed.
>
> ~~~
>
> 4. StandbySlotsHaveCaughtup
>
> +/*
> + * Return true if the slots specified in standby_slot_names have caught
> +up to
> + * the given WAL location, false otherwise.
> + *
> + * The elevel parameter determines the error level used for logging
> +messages
> + * related to slots that do not exist, are invalidated, or are inactive.
> + */
> +bool
> +StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel)
>
> /determines/specifies/
Changed.
>
> ~
>
> 5.
> + /*
> + * Don't need to wait for the standby to catch up if there is no value
> + in
> + * standby_slot_names.
> + */
> + if (!standby_slot_names_list)
> + return true;
> +
> + /*
> + * If we are on a standby server, we do not need to wait for the
> + standby to
> + * catch up since we do not support syncing slots to cascading standbys.
> + */
> + if (RecoveryInProgress())
> + return true;
> +
> + /*
> + * Return true if all the standbys have already caught up to the passed
> + in
> + * WAL localtion.
> + */
> + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
> + standby_slot_oldest_flush_lsn >= wait_for_lsn) return true;
>
>
> 5a.
> I felt all these comments should be worded in a consistent way like "Don't
> need
> to wait ..."
>
> e.g.
> 1. Don't need to wait for the standbys to catch up if there is no value in
> 'standby_slot_names'.
> 2. Don't need to wait for the standbys to catch up if we are on a standby
> server,
> since we do not support syncing slots to cascading standbys.
> 3. Don't need to wait for the standbys to catch up if they are already beyond
> the specified WAL location.
Changed.
>
> ~
>
> 5b.
> typo
> /WAL localtion/WAL location/
>
Fixed.
> ~~~
>
> 6.
> + if (!slot)
> + {
> + /*
> + * It may happen that the slot specified in standby_slot_names GUC
> + * value is dropped, so let's skip over it.
> + */
> + ereport(elevel,
> + errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("replication slot \"%s\" specified in parameter %s does not exist",
> + name, "standby_slot_names"));
> + continue;
> + }
>
> Is "is dropped" strictly the only reason? IIUC another reason is the slot
> maybe
> just did not even exist in the first place but it was not detected before now
> because inititial validation was skipped.
Changed the comment.
>
> ~~~
>
> 7.
> + /*
> + * Return false if not all the standbys have caught up to the specified
> + WAL
> + * location.
> + */
> + if (caught_up_slot_num != list_length(standby_slot_names_list))
> + return false;
>
> Somehow it seems complicated to have a counter of the slots as you process
> then compare that counter to the list_length to determine if one of them was
> skipped.
>
> Probably simpler just to set a 'skipped' flag set whenever you do
> 'continue'...
>
I prefer the current style because we need to set skipped =true in
multiple places which doesn't seem better to me.
> ======
> src/backend/replication/walsender.c
>
> 8.
> +/*
> + * Returns true if there are not enough WALs to be read, or if not all
> +standbys
> + * have caught up to the flushed position when failover_slot is true;
> + * otherwise, returns false.
> + *
> + * Set prioritize_stop to true to skip waiting for WALs if the shutdown
> +signal
> + * is received.
> + *
> + * Set failover_slot to true if the current acquired slot is a failover
> +enabled
> + * slot and we are streaming.
> + *
> + * If returning true, the function sets the appropriate wait event in
> + * wait_event; otherwise, wait_event is set to 0.
> + */
> +static bool
> +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, bool
> +prioritize_stop, bool failover_slot,
> + uint32 *wait_event)
>
> 8a.
> /Set prioritize_stop to true/Specify prioritize_stop=true/
>
> /Set failover_slot to true/Specify failover_slot=true/
This function has been refactored a bit.
>
> ~
>
> 8b.
> Aren't the static function names typically snake_case?
I think the current name style is more consistent with the other functions in
walsender.c.
>
> ~~~
>
> 9.
> + /*
> + * Check if we need to wait for WALs to be flushed to disk. We don't
> + need
> + * to wait for WALs after receiving the shutdown signal unless the
> + * wait_for_wal_on_stop is true.
> + */
> + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING))
> + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
>
> The comment says 'wait_for_wal_on_stop' but the code says 'prioritize_stop'
> (??)
This has been removed in last version.
> ~~~
>
> 10.
> + /*
> + * Check if we need to wait for WALs to be flushed to disk. We don't
> + need
> + * to wait for WALs after receiving the shutdown signal unless the
> + * wait_for_wal_on_stop is true.
> + */
> + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING))
> + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> +
> + /*
> + * Check if the standby slots have caught up to the flushed position.
> + It is
> + * good to wait up to RecentFlushPtr and then let it send the changes
> + to
> + * logical subscribers one by one which are already covered in
> + * RecentFlushPtr without needing to wait on every change for standby
> + * confirmation. Note that after receiving the shutdown signal, an
> + ERROR is
> + * reported if any slots are dropped, invalidated, or inactive. This
> + * measure is taken to prevent the walsender from waiting indefinitely.
> + */
> + else if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn,
> + elevel)) *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> + else
> + return false;
> +
> + return true;
>
> This if/else/else seems overly difficult to read. IMO it will be easier if
> written
> like:
>
> SUGGESTION
> <comment>
> if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) {
> *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> return true;
> }
>
> <comment>
> if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel)) {
> *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> return true;
> }
>
> return false;
Changed.
Attach the V103 patch set which addressed above comments and
Sawada-san's comment[1].
Apart from the comments, the code in WalSndWaitForWal was refactored
a bit to make it neater. Thanks Shveta for helping writing the code and doc.
[1]
https://www.postgresql.org/message-id/CAD21AoBhty79zHgXYMNHH1KqO2VtmjRi22QPmYP2yaHC9WFVdw%40mail.gmail.com
Best Regards,
Hou zj
v103-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description: v103-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
v103-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description: v103-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
