Re: pg_upgrade: optimize replication slot caught-up check

shveta malik Thu, 22 Jan 2026 23:03:15 -0800

On Fri, Jan 23, 2026 at 2:04 AM Masahiko Sawada <[email protected]> wrote:
>
> On Wed, Jan 21, 2026 at 7:49 PM shveta malik <[email protected]> wrote:
> >
> > On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <[email protected]> 
> > wrote:
> > >
> > > On Mon, Jan 19, 2026 at 10:38 PM shveta malik <[email protected]> 
> > > wrote:
> > > >
> > > > On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada 
> > > > <[email protected]> wrote:
> > > > >
> > > > > I've attached the updated patch.
> > > > >
> > > >
> > > > Thank You for the patch. I like the idea of optimization. Few initial 
> > > > comments:
> > >
> > > Thank you for reviewing the patch!
> > >
> > > >
> > > > 1)
> > > > + * The query returns the slot names and their caught-up status in
> > > > + * the same order as the results collected by
> > > > + * get_old_cluster_logical_slot_infos(). If this query is changed,
> > > >
> > > > I could not find the function get_old_cluster_logical_slot_infos(), do
> > > > you mean get_old_cluster_logical_slot_infos_query()?
> > >
> > > It seems an oversight in commit 6d3d2e8e541f0. I think it should be
> > > get_db_rel_and_slot_infos().
> > >
> > > >
> > > > 2)
> > > > "  WHERE database = current_database() AND "
> > > > "    slot_type = 'logical' AND "
> > > >
> > > > Is there a reason why database = current_database() is placed before
> > > > slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
> > > > executor will order these predicates, but from the first look,
> > > > slot_type = 'logical' appears cheaper and could be placed first,
> > > > consistent with the ordering used at other places.
> > >
> > > Changed.
> > >
> > > >
> > > > 3)
> > > > Shouldn’t we add a sanity check inside
> > > > get_old_cluster_logical_slot_infos_query() to ensure that when
> > > > skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
> > > > would make the function safer for future use if it's called elsewhere.
> > > > I understand the caller already performs a similar check, but I think
> > > > it's more appropriate here since we call
> > > > binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
> > > > even exist on newer versions.
> > >
> > > What kind of sanity check did you mean? We can have a check with
> > > pg_fatal() but it seems almost the same to me even if pg_upgrade fails
> > > with an error due to missing
> > > binary_upgrade_logical_slot_has_caught_up().
> >
> > I was referring to a development-level sanity check, something like:
> >
> > /* skip_caught_up_check is required iff PG19 or newer */
> > Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) ==
> >    skip_caught_up_check);
> >
> > But performing this check requires access to the cluster version (or
> > cluster information), which this function currently does not have.
> > Given that, do you think it would make sense to pass the cluster as an
> > argument to this function in order to perform the sanity check here?
>
> Hmm, I think it's better not to have the same check in multiple
> places, but it might make sense to have
> get_old_cluster_logical_slot_infos_query() decide whether to use the
> fast method. I've updated the patch accordingly, please review it.
>


Okay, looks good. Just one minor thing:

+ * Note that binary_upgrade_logical_slot_has_caught_up() is available only
+ * PG18 or older. For PG19 or newer *use_fast_caught_up_check should be
+ * set true, and use binary_upgrade_check_logical_slot_pending_wal()
+ * instead in the separate query (see slot_caught_up_info_query).

Shall we tweak it slightly:

 * Note that binary_upgrade_logical_slot_has_caught_up() is available
 * only in PG18 and earlier. For PG19 and later, set *use_fast_caught_up_check
 * to true and use binary_upgrade_check_logical_slot_pending_wal() instead,
 * in a separate query (see slot_caught_up_info_query in
get_db_rel_and_slot_infos()).


> >
> > > >
> > > > 4)
> > > > +# Check the file content. While both test_slot1 and test_slot2 should
> > > > be reporting
> > > > +# that they have unconsumed WAL records, test_slot3 should not be 
> > > > reported as
> > > > +# it has caught up.
> > > >
> > > > Can you please elaborate the reason behind test_slot3 not being
> > > > reported? Also mention in the comment if possible.
> > >
> > > We advance test_slot3 to the current WAL LSN before executing
> > > pg_upgrade, so the test_slot3 should have consumed all pending WALs.
> > > Please refer to the following changes:
> >
> > I understand the test, and the comments are clear to me. I also
> > understand that only test_slot3 is expected to be in the caught-up
> > state. My questions were specifically about the following points:
> > 1)  Why do we expect 'slot3 caught-up' not to be mentioned in the LOG?
> > Is it simply because there is no corresponding logging in the code, or
> > is this behavior related to some aspect of your fix that I may have
> > missed?
> >
> > 2) In general, we do not validate the absence of LOG messages in
> > tests. Why is this considered a special case where such a check is
> > appropriate?
>
> What LOG do you refer to? In these tests, we check the
> invalid_logical_slots.txt file where pg_upgrade reports only invalid
> slots (in terms of pg_upgrade). For test_slot3, it should not be
> mentioned in that file as it has caught up. Given that the file has
> only invalid slots, checking the absence of test_slot3 in the file
> makes sense to me.
>

Okay, I get the intent now. Thanks!

Other than the comment suggestion above, the patch LGTM.

thanks
Shveta

Re: pg_upgrade: optimize replication slot caught-up check

Reply via email to