009_twophase.pl

Tom Lane Sun, 02 Jul 2017 13:42:01 -0700

I wrote:
> The reporting critters are all on the slow side, so I suspected
> a timing problem, especially since it only started to show up
> after changing pg_ctl's timing behavior.  I can't reproduce it
> locally on unmodified sources, but I could after putting my thumb
> on the scales like this:
> 
> diff --git a/src/backend/replication/walreceiver.c 
> b/src/backend/replication/walreceiver.c
> index ea9d21a..7ef22c1 100644
> *** a/src/backend/replication/walreceiver.c
> --- b/src/backend/replication/walreceiver.c
> *************** WalReceiverMain(void)
> *** 446,451 ****
> --- 446,454 ----
>                                                        * Something was 
> received from master, so reset
>                                                        * timeout
>                                                        */
> + 
> +                                                     pg_usleep(100000); /* 
> be slow about processing */
> + 
>                                                       last_recv_timestamp = 
> GetCurrentTimestamp();
>                                                       ping_sent = false;
>                                                       
> XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);



BTW, I neglected to mention that this hack also produces a second failure
in the recovery test suite:

[20:28:31] t/001_stream_rep.pl .................. 11/28 
#   Failed test 'xmin of cascaded slot null with no hs_feedback'
#   at t/001_stream_rep.pl line 178.
#          got: undef
#     expected: ''

#   Failed test 'catalog xmin of cascaded slot null with no hs_feedback'
#   at t/001_stream_rep.pl line 179.
#          got: undef
#     expected: ''
[20:28:31] t/001_stream_rep.pl .................. 28/28 # Looks like you failed 
2 tests of 28.
[20:28:31] t/001_stream_rep.pl .................. Dubious, test returned 2 
(wstat 512, 0x200)
Failed 2/28 subtests 

I had supposed last night that this was merely a matter of needing
to insert a wait_slot_xmins() call before those tests, but doing so
doesn't help.  And, looking closer, the report is complaining that
$xmin and $catalog_xmin are coming back as undef, not empty strings.
My Perl-fu is too weak to figure out how that could possibly be ---
the documentation for PostgresNode::query_hash certainly states
flatly that it will never happen.  And it even more certainly
shouldn't happen if we already successfully probed the
pg_replication_slots row in wait_slot_xmins().

The corresponding entries in the postmaster's log don't look abnormal:

2017-07-02 20:07:29.313 UTC [30862] LOG:  database system is ready to accept 
read only connections
2017-07-02 20:07:29.318 UTC [30867] LOG:  started streaming WAL from primary at 
0/3000000 on timeline 1
2017-07-02 20:07:29.415 UTC [30870] t/001_stream_rep.pl LOG:  statement: SELECT 
pg_create_physical_replication_slot('standby_2');
2017-07-02 20:07:29.619 UTC [30863] LOG:  redo starts at 0/3029458
2017-07-02 20:07:29.665 UTC [30882] t/001_stream_rep.pl LOG:  statement: 
                        SELECT xmin IS NULL AND catalog_xmin IS NULL
                        FROM pg_catalog.pg_replication_slots
                        WHERE slot_name = 'standby_2';
                
2017-07-02 20:07:29.679 UTC [30884] t/001_stream_rep.pl LOG:  statement: SELECT 
plugin, slot_type, datoid, database, active, active_pid, xmin, catalog_xmin, 
restart_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2'

(This is from a run with a wait_slot_xmins() call added, else the
statement logged by PID 30882 would not have been issued.)

Any ideas what's wrong there?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

Reply via email to