Hi,

On Wed, Mar 19, 2025 at 12:12:19PM +0530, Amit Kapila wrote:
> On Mon, Feb 10, 2025 at 8:12 PM Bertrand Drouvot
> <bertranddrouvot...@gmail.com> wrote:
> >
> > Please find attached a patch to $SUBJECT.
> >
> > In rare circumstances (and on slow machines) it is possible that a 
> > xl_running_xacts
> > is emitted and that the catalog_xmin of a logical slot on the standby 
> > advances
> > past the conflict point. In that case, no conflict is reported and the test
> > fails. It has been observed several times and the last discussion can be 
> > found
> > in [1].
> >
> 

Thanks for looking at it!

> Is my understanding correct that bgwriter on primary node has created
> a  xl_running_xacts, then that record is replicated to standby, and
> while decoding it (xl_running_xacts) on standby via active_slot, we
> advanced the catalog_xmin of active_slot? If this happens then the
> replay of vacuum record on standby won't be able to invalidate the
> active slot, right?

Yes, that's also my understanding. It's also easy to "simulate" by adding
a checkpoint on the primary and a long enough sleep after we launched our sql 
in 
wait_until_vacuum_can_remove().

> So, if the above is correct, the reason for generating extra
> xl_running_xacts on primary is Vacuum followed by Insert on primary
> via below part of test:
> $node_primary->safe_psql(
> 'testdb', qq[VACUUM $vac_option verbose $to_vac;
>   INSERT INTO flush_wal DEFAULT VALUES;]);

I'm not sure, I think a xl_running_xacts could also be generated (for example by
the checkpointer) before the vacuum (should the system be slow enough).

> > Remarks:
> >
> > R1. The issue still remains in v16 though (as injection points are 
> > available since
> > v17).
> >
> 
> This is not idle case because the test would still keep failing
> intermittently on 16.

I do agree.

> I am wondering what if we start a transaction
> before vacuum and do some DML in it but didn't commit that xact till
> the active_slot test is finished then even the extra logging of
> xl_running_xacts shouldn't advance xmin during decoding.

I'm not sure, as I think a xl_running_xacts could still be generated after
we execute "our sql" meaning:

"
    $node_primary->safe_psql('testdb', qq[$sql]);
"

and before we launch the new DML. In that case I guess the issue could still
happen.

OTOH If we create the new DML "before" we launch "our sql" then the test
would also fail for both active and inactive slots because that would not
invalidate any slots.

I did observe the above with the attached changes (just changing the PREPARE
TRANSACTION location).

> we should try to find some solution which could be
> backpatched to 16 as well.

I agree, but I'm not sure it's doable as it looks to me that we should prevent
the catalog xmin to advance to advance past the conflict point while still
generating a conflict point. Will try to give it another thought.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index c31cab06f1c..edd8009fbce 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -258,10 +258,20 @@ sub wait_until_vacuum_can_remove
        # Launch our sql.
        $node_primary->safe_psql('testdb', qq[$sql]);
 
+       $node_primary->safe_psql('testdb',"CHECKPOINT");
+       sleep(20);
+
+       $node_primary->safe_psql(
+           'testdb', "
+               BEGIN;
+               PREPARE TRANSACTION 'prevent_slot_advance_v1';
+               INSERT INTO prevent_slot_advance VALUES (1);"
+       );
+
        # Wait until we get a newer horizon.
-       $node_primary->poll_query_until('testdb',
-               "SELECT (select 
pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
-       ) or die "new snapshot does not have a newer horizon";
+       #$node_primary->poll_query_until('testdb',
+       #       "SELECT (select 
pg_snapshot_xmin(pg_current_snapshot())::text::int - $xid_horizon) > 0"
+       #) or die "new snapshot does not have a newer horizon";
 
        # Launch the vacuum command and insert into flush_wal (see CREATE
        # TABLE flush_wal for the reason).
@@ -281,7 +291,9 @@ wal_level = 'logical'
 max_replication_slots = 4
 max_wal_senders = 4
 autovacuum = off
+max_prepared_transactions = 10
 });
+
 $node_primary->dump_info;
 $node_primary->start;
 
@@ -305,6 +317,7 @@ $node_primary->backup($backup_name);
 # Some tests need to wait for VACUUM to be replayed. But vacuum does not flush
 # WAL. An insert into flush_wal outside transaction does guarantee a flush.
 $node_primary->psql('testdb', q[CREATE TABLE flush_wal();]);
+$node_primary->psql('testdb', q[CREATE TABLE prevent_slot_advance(a int);]);
 
 #######################
 # Initialize standby node
@@ -565,6 +578,8 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL 
on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 
'prevent_slot_advance_v1'");
+
 # Attempting to alter an invalidated slot should result in an error
 ($result, $stdout, $stderr) = $node_standby->psql(
        'postgres',
@@ -664,6 +679,8 @@ check_for_invalidation('row_removal_', $logstart, 'with 
vacuum on pg_class');
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('row_removal_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 
'prevent_slot_advance_v1'");
+
 $handle =
   make_slot_active($node_standby, 'row_removal_', 0, \$stdout, \$stderr);
 
@@ -699,6 +716,8 @@ check_for_invalidation('shared_row_removal_', $logstart,
 # Verify reason for conflict is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('shared_row_removal_', 'rows_removed');
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 
'prevent_slot_advance_v1'");
+
 $handle = make_slot_active($node_standby, 'shared_row_removal_', 0, \$stdout,
        \$stderr);
 
@@ -737,6 +756,8 @@ ok( !$node_standby->log_contains(
        'activeslot slot invalidation is not logged with vacuum on 
conflict_test'
 );
 
+$node_primary->safe_psql('testdb', "COMMIT PREPARED 
'prevent_slot_advance_v1'");
+
 # Verify that pg_stat_database_conflicts.confl_active_logicalslot has not been 
updated
 ok( $node_standby->poll_query_until(
                'postgres',

Reply via email to