On Fri, Feb 7, 2025 at 8:00 AM Peter Smith <smithpb2...@gmail.com> wrote: > > Hi Nisha, > > Some review comments for v71-0001. > > ====== > src/backend/replication/slot.c > > SlotInvalidationCauses[] > > 2. > [RS_INVAL_WAL_REMOVED] = "wal_removed", > [RS_INVAL_HORIZON] = "rows_removed", > [RS_INVAL_WAL_LEVEL] = "wal_level_insufficient", > + [RS_INVAL_IDLE_TIMEOUT] = "idle_timeout", > }; > > By using bit flags in the enum (see slot.h) and designated > initializers here in SlotInvalidationCauses[], you'll end up with 9 > entries (0-0x08) instead of 4, and the other undesignated entries will > be all NULL. Maybe it is intended, but if it is I think it is strange > to be indexing by bit flags so at least you should have a comment. > > If you really need bitflags then perhaps it is better to maintain them > in addition to the v70 enum values (??) > > ~~~ > > 3. > /* Maximum number of invalidation causes */ > -#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL > +#define RS_INVAL_MAX_CAUSES RS_INVAL_IDLE_TIMEOUT > > Hmm. The impact of using bit flags has (probably) unintended > consequences. e.g. Now you've made the GetSlotInvalidationCause() > function worse than before because now it will be iterating over all > the undesignated NULL entries of the array when searching for the > matching cause. >
Introduced a new struct, "SlotInvalidationCauseMap", to store invalidation cause enums and their corresponding cause_name strings. Replaced "SlotInvalidationCauses[]" with a map (array of structs), eliminating extra NULL spaces and reducing unnecessary iterations. With this change, a new function, GetSlotInvalidationCauseName(), was added to retrieve the cause_name string for a given cause_enum. > ~~~ > > 4. > + /* Calculate the idle time duration of the slot */ > + elapsed_secs = (now - inactive_since) / USECS_PER_SEC; > + minutes = elapsed_secs / SECS_PER_MINUTE; > + secs = elapsed_secs % SECS_PER_MINUTE; > + > + /* translator: %s is a GUC variable name */ > + appendStringInfo(&err_detail, _("The slot's idle time of %d minutes > and %d seconds exceeds the configured \"%s\" duration."), > + minutes, secs, "idle_replication_slot_timeout"); > > Idleness timeout durations defined like 1d aren't going to look pretty > using this log format. We already discussed off-list about how to make > this better, but not done yet? > There was an off-list suggestion to include the configured GUC value in the err_detail message for better clarity. This change makes it easier for users to compare, especially for large values like 1d. For example, if the timeout duration is set to 1d, the message will now appear as: " The slot's idle time of 1440 minutes and 54 seconds exceeds the configured "idle_replication_slot_timeout" duration of 1440 minutes." Thoughts? > ~~~ > > 6. > ReportSlotInvalidation(invalidation_cause, true, active_pid, > slotname, restart_lsn, > - oldestLSN, snapshotConflictHorizon); > + oldestLSN, snapshotConflictHorizon, > + inactive_since, now); > > if (MyBackendType == B_STARTUP) > (void) SendProcSignal(active_pid, > @@ -1785,7 +1881,8 @@ > InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, > > ReportSlotInvalidation(invalidation_cause, false, active_pid, > slotname, restart_lsn, > - oldestLSN, snapshotConflictHorizon); > + oldestLSN, snapshotConflictHorizon, > + inactive_since, now); > > If the cause was not already (masked with) RS_INVAL_IDLE_TIMEOUT then > AFAICT 'now' will still be 0 here. > > This seems an unexpected quirk, which at best is quite misleading. > Even if the code sty like this I felt ReportSlotInvalidation should > Assert 'now' must be 0 unless the cause passed was > RS_INVAL_IDLE_TIMEOUT. > 'now' will be non-zero even when RS_INVAL_IDLE_TIMEOUT is masked with other possible causes like RS_INVAL_WAL_REMOVED, and the slot gets invalidated first due to RS_INVAL_WAL_REMOVED. Therefore, 'now' being non-zero is not exclusive to RS_INVAL_IDLE_TIMEOUT. However, since it must be non-zero when the cause in ReportSlotInvalidation() is RS_INVAL_IDLE_TIMEOUT, I've added an Assert for the same. > ~~~ > ====== > src/include/replication/slot.h > > 8. > - * When adding a new invalidation cause here, remember to update > + * When adding a new invalidation cause here, the value must be powers of 2 > + * (e.g., 1, 2, 4...) for proper bitwise operations. Also, remember to update > * SlotInvalidationCauses and RS_INVAL_MAX_CAUSES. > */ > typedef enum ReplicationSlotInvalidationCause > { > - RS_INVAL_NONE, > + RS_INVAL_NONE = 0x00, > /* required WAL has been removed */ > - RS_INVAL_WAL_REMOVED, > + RS_INVAL_WAL_REMOVED = 0x01, > /* required rows have been removed */ > - RS_INVAL_HORIZON, > + RS_INVAL_HORIZON = 0x02, > /* wal_level insufficient for slot */ > - RS_INVAL_WAL_LEVEL, > + RS_INVAL_WAL_LEVEL = 0x04, > + /* idle slot timeout has occurred */ > + RS_INVAL_IDLE_TIMEOUT = 0x08, > } ReplicationSlotInvalidationCause; > > 8a. > IMO enums are intended for discrete values like "red" or "blue", but > not combinations of values like "reddy-bluey". AFAIK this kind of > usage is not normal and is discouraged in C programming. > > So if you need bitflags then really the bit flags should be #define etc. > I feel using the ReplicationSlotInvalidationCause type instead of "int" (in case we use macros) improves code readability and maintainability. OTOH, keeping the enums as they are in v70, and defining new macros for the very similar purpose could add unnecessary complexity to code management. > ~ > > 8b. > Does it make sense? You can't invalidate something that is already > invalid, so what does it even mean to have multiple simultaneous > ReplicationSlotInvalidationCause values? AFAICT it was only done like > this to CHECK for multiple **possible** causes, but this point is not > very clear > Added comments at the top of InvalidateObsoleteReplicationSlots() to clarify that it tries to invalidate slots for multiple possible causes in a single pass, as explained in [1]. > ~ > > 8c. > This introduces a side-effect that now the char *const > SlotInvalidationCauses[] array in slot.c will have 8 entries, half of > them NULL. Already mentioned elsewhere. And, this will get > increasingly worse if more invalidation reasons get added. 8,16,32,64 > mostly unused entries etc... > This issue is now resolved by replacing SlotInvalidationCauses[] with a new array of structures. > ====== Attached v72 patches, addressed the above comments as well as Vignesh's comments in [2]. - There are no new changes in patch-002. [1] https://www.postgresql.org/message-id/CAA4eK1Jatoapf2NoX2nJOJ8k-RvEZM%3DMoFUvNWPz4rRR1simQw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CALDaNm3wx8ihfkidveKuK%3DgGujS_yc9sEgq6ev-T%2BW3zeHM88g%40mail.gmail.com -- Thanks, Nisha
v72-0001-Introduce-inactive_timeout-based-replication-slo.patch
Description: Binary data
v72-0002-Add-TAP-test-for-slot-invalidation-based-on-inac.patch
Description: Binary data