On Wed, Apr 3, 2024 at 12:20 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote: > > > Please find the attached v31 patches implementing the above idea: > > Some comments related to v31-0001: > > === testing the behavior > > T1 === > > > - synced slots get their on inactive_since just like any other slot > > It behaves as described. > > T2 === > > > - synced slots inactive_since is set to current timestamp after the > > standby gets promoted to help inactive_since interpret correctly just > > like any other slot. > > It behaves as described.
Thanks for testing. > CR1 === > > + <structfield>inactive_since</structfield> value will get updated > + after every synchronization > > indicates the last synchronization time? (I think that after every > synchronization > could lead to confusion). Done. > CR2 === > > + /* > + * Set the time since the slot has become inactive > after shutting > + * down slot sync machinery. This helps correctly > interpret the > + * time if the standby gets promoted without a > restart. > + */ > > It looks to me that this comment is not at the right place because there is > nothing after the comment that indicates that we shutdown the "slot sync > machinery". > > Maybe a better place is before the function definition and mention that this > is > currently called when we shutdown the "slot sync machinery"? Done. > CR3 === > > + * We get the current time beforehand and only once > to avoid > + * system calls overhead while holding the lock. > > s/avoid system calls overhead while holding the lock/avoid system calls while > holding the spinlock/? Done. > CR4 === > > + * Set the time since the slot has become inactive. We get the current > + * time beforehand to avoid system call overhead while holding the > lock > > Same. Done. > CR5 === > > + # Check that the captured time is sane > + if (defined $reference_time) > + { > > s/Check that the captured time is sane/Check that the inactive_since is sane/? > > Sorry if some of those comments could have been done while I did review > v29-0001. Done. On Wed, Apr 3, 2024 at 2:58 PM shveta malik <shveta.ma...@gmail.com> wrote: > > Thanks for the patches, please find few comments: > > v31-001: > > 1) > system-views.sgml: > value will get updated after every synchronization from the > corresponding remote slot on the primary. > > --This is confusing. It will be good to rephrase it. Done as per Bertrand's suggestion. > 2) > update_synced_slots_inactive_since() > > --May be, we should mention in the header that this function is called > only during promotion. Done as per Bertrand's suggestion. > 3) 040_standby_failover_slots_sync.pl: > We capture inactive_since_on_primary when we do this for the first time at > #175 > ALTER SUBSCRIPTION regress_mysub1 DISABLE" > > But we again recreate the sub and disable it at line #280. > Do you think we shall get inactive_since_on_primary again here, to be > compared with inactive_since_on_new_primary later? Hm. Done that. Recapturing both slot_creation_time_on_primary and inactive_since_on_primary before and after CREATE SUBSCRIPTION creates the slot again on the primary/publisher. On Wed, Apr 3, 2024 at 3:32 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > CR2 === > > > > + /* > > + * Set the time since the slot has become inactive > > after shutting > > + * down slot sync machinery. This helps correctly > > interpret the > > + * time if the standby gets promoted without a > > restart. > > + */ > > > > It looks to me that this comment is not at the right place because there is > > nothing after the comment that indicates that we shutdown the "slot sync > > machinery". > > > > Maybe a better place is before the function definition and mention that > > this is > > currently called when we shutdown the "slot sync machinery"? > > > Won't it be better to have an assert for SlotSyncCtx->pid? IIRC, we > have some existing issues where we don't ensure that no one is running > sync API before shutdown is complete but I think we can deal with that > separately and here we can still have an Assert. That can work to ensure the slot sync worker isn't running as SlotSyncCtx->pid gets updated only for the slot sync worker. I added this assertion for now. We need to ensure (in a separate patch and thread) there is no backend acquiring it and performing sync while the slot sync worker is shutting down. Otherwise, some of the slots can get resynced and some are not while we are shutting down the slot sync worker as part of the standby promotion which might leave the slots in an inconsistent state. > > CR3 === > > > > + * We get the current time beforehand and only once > > to avoid > > + * system calls overhead while holding the lock. > > > > s/avoid system calls overhead while holding the lock/avoid system calls > > while holding the spinlock/? > > > Is it valid to say that there is overhead of this call while holding > spinlock? Because I don't think at the time of promotion we expect any > other concurrent slot activity. The first reason seems good enough. No slot activity but why GetCurrentTimestamp needs to be called every time in a loop. > One other observation: > --- a/src/backend/replication/slot.c > +++ b/src/backend/replication/slot.c > @@ -42,6 +42,7 @@ > #include "access/transam.h" > #include "access/xlog_internal.h" > #include "access/xlogrecovery.h" > +#include "access/xlogutils.h" > > Is there a reason for this inclusion? I don't see any change which > should need this one. Not anymore. It was earlier needed for using the InRecovery flag in the then approach. On Wed, Apr 3, 2024 at 4:19 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > 3) 040_standby_failover_slots_sync.pl: > > We capture inactive_since_on_primary when we do this for the first time at > > #175 > > ALTER SUBSCRIPTION regress_mysub1 DISABLE" > > > > But we again recreate the sub and disable it at line #280. > > Do you think we shall get inactive_since_on_primary again here, to be > > compared with inactive_since_on_new_primary later? > > > > I think so. Modified this to recapture the times before and after the slot gets recreated. > Few additional comments on tests: > 1. > +is( $standby1->safe_psql( > + 'postgres', > + "SELECT '$inactive_since_on_primary'::timestamptz < > '$inactive_since_on_standby'::timestamptz AND > + '$inactive_since_on_standby'::timestamptz < '$slot_sync_time'::timestamptz;" > > Shall we do <= check as we are doing in the main function > get_slot_inactive_since_value as the time duration is less so it can > be the same as well? Similarly, please check other tests. I get you. If the tests are so fast that losing a bit of precision might cause tests to fail. So, I'v added equality check for all the tests. > 2. > +=item $node->get_slot_inactive_since_value(self, slot_name, reference_time) > + > +Get inactive_since column value for a given replication slot validating it > +against optional reference time. > + > +=cut > + > +sub get_slot_inactive_since_value > > I see that all callers validate against reference time. It is better > to name it validate_slot_inactive_since rather than using get_* as the > main purpose is to validate the passed value. Existing callers yes. Also, I've removed the reference time as an optional parameter. Per an offlist chat with Amit, I've added the following note in synchronize_one_slot: @@ -584,6 +585,11 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) * overwriting 'invalidated' flag to remote_slot's value. See * InvalidatePossiblyObsoleteSlot() where it invalidates slot directly * if the slot is not acquired by other processes. + * + * XXX: If it ever turns out that slot acquire/release is costly for + * cases when none of the slot property is changed then we can do a + * pre-check to ensure that at least one of the slot property is + * changed before acquiring the slot. */ ReplicationSlotAcquire(remote_slot->name, true); Please find the attached v32-0001 patch with the above review comments addressed. I'm working on review comments for 0002. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
v32-0001-Allow-synced-slots-to-have-their-own-inactive_si.patch
Description: Binary data