Re: Minimal logical decoding on standbys

Andres Freund Fri, 05 Apr 2019 16:16:21 -0700

Hi,

Thanks for the new version of the patch.  Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.


On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:
> Regarding the test result failures, I could see that when we drop a
> logical replication slot at standby server, then the catalog_xmin of
> physical replication slot becomes NULL, whereas the test expects it to
> be equal to xmin; and that's the reason a couple of test scenarios are
> failing :
> 
> ok 33 - slot on standby dropped manually
> Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on 
> master
> done
> not ok 34 - physical catalog_xmin still non-null
> not ok 35 - xmin and catalog_xmin equal after slot drop
> #   Failed test 'xmin and catalog_xmin equal after slot drop'
> #   at t/016_logical_decoding_on_replica.pl line 272.
> #          got:
> #     expected: 2584
> 
> I am not sure what is expected. What actually happens is : the
> physical xlot catalog_xmin remains NULL initially, but becomes
> non-NULL after the logical replication slot is created on standby.

That seems like the correct behaviour to me - why would we still have a
catalog xmin if there's no slot logical slot?


> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 006446b..5785d2f 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
>       }
>  }
>  
> +void
> +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> +{
> +     int                     i;
> +     bool            found_conflict = false;
> +
> +     if (max_replication_slots <= 0)
> +             return;
> +
> +restart:
> +     if (found_conflict)
> +     {
> +             CHECK_FOR_INTERRUPTS();
> +             /*
> +              * Wait awhile for them to die so that we avoid flooding an
> +              * unresponsive backend when system is heavily loaded.
> +              */
> +             pg_usleep(100000);
> +             found_conflict = false;
> +     }
> +
> +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +     for (i = 0; i < max_replication_slots; i++)
> +     {
> +             ReplicationSlot *s;
> +             NameData        slotname;
> +             TransactionId slot_xmin;
> +             TransactionId slot_catalog_xmin;
> +
> +             s = &ReplicationSlotCtl->replication_slots[i];
> +
> +             /* cannot change while ReplicationSlotCtlLock is held */
> +             if (!s->in_use)
> +                     continue;
> +
> +             /* not our database, skip */
> +             if (s->data.database != InvalidOid && s->data.database != dboid)
> +                     continue;
> +
> +             SpinLockAcquire(&s->mutex);
> +             slotname = s->data.name;
> +             slot_xmin = s->data.xmin;
> +             slot_catalog_xmin = s->data.catalog_xmin;
> +             SpinLockRelease(&s->mutex);
> +
> +             if (TransactionIdIsValid(slot_xmin) && 
> TransactionIdPrecedesOrEquals(slot_xmin, xid))
> +             {
> +                     found_conflict = true;
> +
> +                     ereport(WARNING,
> +                                     (errmsg("slot %s w/ xmin %u conflicts 
> with removed xid %u",
> +                                                     NameStr(slotname), 
> slot_xmin, xid)));
> +             }
> +
> +             if (TransactionIdIsValid(slot_catalog_xmin) && 
> TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> +             {
> +                     found_conflict = true;
> +
> +                     ereport(WARNING,
> +                                     (errmsg("slot %s w/ catalog xmin %u 
> conflicts with removed xid %u",
> +                                                     NameStr(slotname), 
> slot_catalog_xmin, xid)));
> +             }
> +
> +
> +             if (found_conflict)
> +             {
> +                     elog(WARNING, "Dropping conflicting slot %s", 
> s->data.name.data);
> +                     LWLockRelease(ReplicationSlotControlLock);      /* 
> avoid deadlock */
> +                     ReplicationSlotDropPtr(s);
> +
> +                     /* We released the lock above; so re-scan the slots. */
> +                     goto restart;
> +             }
> +     }
>
I think this should be refactored so that the two found_conflict cases
set a 'reason' variable (perhaps an enum?) to the particular reason, and
then only one warning should be emitted.  I also think that LOG might be
more appropriate than WARNING - as confusing as that is, LOG is more
severe than WARNING (see docs about log_min_messages).


> @@ -0,0 +1,386 @@
> +# Demonstrate that logical can follow timeline switches.
> +#
> +# Test logical decoding on a standby.
> +#
> +use strict;
> +use warnings;
> +use 5.8.0;
> +
> +use PostgresNode;
> +use TestLib;
> +use Test::More tests => 55;
> +use RecursiveCopy;
> +use File::Copy;
> +
> +my ($stdin, $stdout, $stderr, $ret, $handle, $return);
> +my $backup_name;
> +
> +# Initialize master node
> +my $node_master = get_new_node('master');
> +$node_master->init(allows_streaming => 1, has_archiving => 1);
> +$node_master->append_conf('postgresql.conf', q{
> +wal_level = 'logical'
> +max_replication_slots = 4
> +max_wal_senders = 4
> +log_min_messages = 'debug2'
> +log_error_verbosity = verbose
> +# send status rapidly so we promptly advance xmin on master
> +wal_receiver_status_interval = 1
> +# very promptly terminate conflicting backends
> +max_standby_streaming_delay = '2s'
> +});
> +$node_master->dump_info;
> +$node_master->start;
> +
> +$node_master->psql('postgres', q[CREATE DATABASE testdb]);
> +
> +$node_master->safe_psql('testdb', q[SELECT * FROM 
> pg_create_physical_replication_slot('decoding_standby');]);
> +$backup_name = 'b1';
> +my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
> +TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', 
> $node_master->connstr('testdb'), '--slot=decoding_standby');
> +
> +sub print_phys_xmin
> +{
> +     my $slot = $node_master->slot('decoding_standby');
> +     return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
> +}
> +
> +my ($xmin, $catalog_xmin) = print_phys_xmin();
> +# After slot creation, xmins must be null
> +is($xmin, '', "xmin null");
> +is($catalog_xmin, '', "catalog_xmin null");
> +
> +my $node_replica = get_new_node('replica');
> +$node_replica->init_from_backup(
> +     $node_master, $backup_name,
> +     has_streaming => 1,
> +     has_restoring => 1);
> +$node_replica->append_conf('postgresql.conf',
> +     q[primary_slot_name = 'decoding_standby']);
> +
> +$node_replica->start;
> +$node_master->wait_for_catchup($node_replica, 'replay', 
> $node_master->lsn('flush'));
> +
> +# with hot_standby_feedback off, xmin and catalog_xmin must still be null
> +($xmin, $catalog_xmin) = print_phys_xmin();
> +is($xmin, '', "xmin null after replica join");
> +is($catalog_xmin, '', "catalog_xmin null after replica join");
> +
> +$node_replica->append_conf('postgresql.conf',q[
> +hot_standby_feedback = on
> +]);
> +$node_replica->restart;
> +sleep(2); # ensure walreceiver feedback sent

Can we make this more robust? E.g. by waiting till pg_stat_replication
shows the change on the primary? Because I can guarantee that this'll
fail on slow buildfarm machines (say the valgrind animals).




> +$node_master->wait_for_catchup($node_replica, 'replay', 
> $node_master->lsn('flush'));
> +sleep(2); # ensure walreceiver feedback sent

Similar.


Greetings,

Andres Freund

Re: Minimal logical decoding on standbys

Reply via email to