Hi, Thanks for reviewing.
On Mon, Sep 2, 2024 at 1:37 PM Peter Smith <smithpb2...@gmail.com> wrote: > > Commit message. > > 1. > Because such synced slots are typically considered not > active (for them to be later considered as inactive) as they don't > perform logical decoding to produce the changes. > > This sentence is bad grammar. The docs have the same wording, so > please see my doc review comment #4 suggestion below. +1 > 2. > + <para> > + Invalidates replication slots that are inactive for longer than > + specified amount of time. If this value is specified without units, > + it is taken as seconds. A value of zero (which is default) disables > + the timeout mechanism. This parameter can only be set in > + the <filename>postgresql.conf</filename> file or on the server > + command line. > + </para> > + > > nit - This is OK as-is, but OTOH why not make the wording consistent > with the previous GUC description? (e.g. see my v43 [1] #2 review > comment) +1. > 3. > + <para> > + This invalidation check happens either when the slot is acquired > + for use or during checkpoint. The time since the slot has become > + inactive is known from its > + <structfield>inactive_since</structfield> value using which the > + timeout is measured. > + </para> > + > > I felt this is slightly misleading because slot acquiring has nothing > to do with setting the slot invalidation anymore. Furthermore, the 2nd > sentence is bad grammar. > > nit - IMO something simple like the following rewording can address > both of those points: > > Slot invalidation due to inactivity timeout occurs during checkpoint. > The duration of slot inactivity is calculated using the slot's > <structfield>inactive_since</structfield> field value. +1. > 4. > + Because such synced slots are typically considered not active > + (for them to be later considered as inactive) as they don't perform > + logical decoding to produce the changes. > > That sentence has bad grammar. > > nit – suggest a much simpler replacement: > Synced slots are always considered to be inactive because they don't > perform logical decoding to produce changes. +1. > 5. > +#define IsInactiveTimeoutSlotInvalidationApplicable(s) \ > > 5a. > I felt this would be better implemented as an inline function. Then it > can be commented on properly to explain the parts of the condition. > e.g. the large comment currently in InvalidatePossiblyObsoleteSlot() > would be more appropriate in this function. +1. > 5b. > The name is very long. Can't it be something shorter/simpler like: > 'IsSlotATimeoutCandidate()' > > ~~~ Missing inactive in the above suggested name. Used SlotInactiveTimeoutCheckAllowed, similar to XLogInsertAllowed. > 6. ReplicationSlotAcquire > > -ReplicationSlotAcquire(const char *name, bool nowait) > +ReplicationSlotAcquire(const char *name, bool nowait, > + bool check_for_invalidation) > > nit - Previously this new parameter really did mean to "check" for > [and set the slot] invalidation. But now I suggest renaming it to > 'error_if_invalid' to properly reflect the new usage. And also in the > slot.h. +1. > 7. > + /* > + * Error out if the slot has been invalidated previously. Because there's > + * no use in acquiring the invalidated slot. > + */ > > nit - The comment is contrary to the code. If there was no reason to > skip this error, then you would not have the new parameter allowing > you to skip this error. I suggest just repeating the same comment as > in the function header. +1. > 8. ReportSlotInvalidation > > nit - Added some blank lines for consistency. +1. > 9. InvalidatePossiblyObsoleteSlot > > 9a. > Consistency is good (commit message, docs and code comments for this), > but the added sentence has bad grammar. Please see the docs review > comment #4 above for some alternate phrasing. +1. > 9b. > Now that this logic is moved into a macro (I suggested it should be an > inline function) IMO this comment does not belong here anymore because > it is commenting code that you cannot see. Instead, this comment (or > something like it) should be as comments within the new function. > > ====== > src/include/replication/slot.h +1. > 10. > +extern void ReplicationSlotAcquire(const char *name, bool nowait, > + bool check_for_invalidation); > > Change the new param name as described in the earlier review comment. +1. > Please refer to the attached file which implements some of the nits > mentioned above. Merged the diff into v45. Thanks. On Tue, Sep 3, 2024 at 12:26 PM Peter Smith <smithpb2...@gmail.com> wrote: > > TEST CASE #1 > > 1. > +# Wait for the inactive replication slot to be invalidated. > > Is that comment correct? IIUC the synced slot should *already* be > invalidated from the primary, so here we are not really "waiting" for > it to be invalidated; Instead, we are just "confirming" that the > synchronized slot is already invalidated with the correct reason as > expected. Modified the comment. > 2. > +# Synced slot mustn't get invalidated on the standby even after a checkpoint, > +# it must sync invalidation from the primary. So, we must not see the slot's > +# invalidation message in server log. > > This test case seemed bogus, for a couple of reasons: > > 2a. IIUC this 'lsub1_sync_slot' is the same one that is already > invalid (from the primary), so nobody should be surprised that an > already invalid slot doesn't get flagged as invalid again. i.e. > Shouldn't your test scenario here be done using a valid synced slot? +1. Added another test case for checking the synced slot not getting invalidated despite inactive timeout being set. > 2b. AFAICT it was only moments above this CHECKPOINT where you > assigned the standby inactivity timeout to 2s. So even if there was > some bug invalidating synced slots I don't think you gave it enough > time to happen -- e.g. I doubt 2s has elapsed yet. Added sleep(timeout+1) before the checkpoint. > 3. > +# Stop standby to make the standby's replication slot on the primary inactive > +$standby1->stop; > + > +# Wait for the standby's replication slot to become inactive > > TEST CASE #2 > > 4. > +# Stop subscriber to make the replication slot on publisher inactive > +$subscriber->stop; > + > +# Wait for the replication slot to become inactive and then invalidated due > to > +# timeout. > +wait_for_slot_invalidation($publisher, 'lsub1_slot', $logstart, > + $inactive_timeout); > > IIUC, this is just like comment #3 above. Both these (the stop and the > wait) seem to belong together, so I think maybe a single bigger > explanatory comment covering both parts would help for understanding. Done. > 5. > +# Testcase end: Invalidate logical subscriber's slot due to > +# replication_slot_inactive_timeout. > +# > ============================================================================= > > IMO the rest of the comment after "Testcase end" isn't very useful. Removed. > ====== > sub wait_for_slot_invalidation > > 6. > +sub wait_for_slot_invalidation > +{ > > An explanatory header comment for this subroutine would be helpful. Done. > 7. > + # Wait for the replication slot to become inactive > + $node->poll_query_until( > > Why are there are 2 separate poll_query_until's here? Can't those be > combined into just one? Ah. My bad. Removed. > ~~~ > > 8. > + # Sleep at least $inactive_timeout duration to avoid multiple checkpoints > + # for the slot to get invalidated. > + sleep($inactive_timeout); > + > > Maybe this special sleep to prevent too many CHECKPOINTs should be > moved to be inside the other subroutine, which is actually doing those > CHECKPOINTs. Done. > 9. > + # Wait for the inactive replication slot to be invalidated > + "Timed out while waiting for inactive slot $slot_name to be > invalidated on node $name"; > + > > The comment seems misleading. IIUC you are not "waiting" for the > invalidation here, because it is the other subroutine doing the > waiting for the invalidation message in the logs. Instead, here I > think you are just confirming the 'invalidation_reason' got set > correctly. The comment should say what it is really doing. Modified. > sub check_for_slot_invalidation_in_server_log > > 10. > +# Check for invalidation of slot in server log > +sub check_for_slot_invalidation_in_server_log > +{ > > I think the main function of this subroutine is the CHECKPOINT and the > waiting for the server log to say invalidation happened. It is doing a > loop of a) CHECKPOINT then b) inspecting the server log for the slot > invalidation, and c) waiting for a bit. Repeat 10 times. > > A comment describing the logic for this subroutine would be helpful. > > The most important side-effect of this function is the CHECKPOINT > because without that nothing will ever get invalidated due to > inactivity, but this key point is not obvious from the subroutine > name. > > IMO it would be better to name this differently to reflect what it is > really doing: > e.g. "CHECKPOINT_and_wait_for_slot_invalidation_in_server_log" That would be too long. Changed the function name to trigger_slot_invalidation() which is appropriate. Please find the v45 patch. Addressed above and Shveta's review comments [1]. Amit's comments [2] and [3] are still pending. [1] https://www.postgresql.org/message-id/CAJpy0uC8Dg-0JS3NRUwVUemgz5Ar2v3_EQQFXyAigWSEQ8U47Q%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1K7DdT_5HnOWs5tVPYC%3D-h%2Bm85wu7k-7RVJaJ7zMxprWQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAA4eK1%2Bkt-QRr1RP%3DD%3D4_tp%2BS%2BCErQ6rNe7KVYEyZ3f6PYXpvw%40mail.gmail.com -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
v45-0001-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data