I wrote: > The reporting critters are all on the slow side, so I suspected > a timing problem, especially since it only started to show up > after changing pg_ctl's timing behavior. I can't reproduce it > locally on unmodified sources, but I could after putting my thumb > on the scales like this: > > diff --git a/src/backend/replication/walreceiver.c > b/src/backend/replication/walreceiver.c > index ea9d21a..7ef22c1 100644 > *** a/src/backend/replication/walreceiver.c > --- b/src/backend/replication/walreceiver.c > *************** WalReceiverMain(void) > *** 446,451 **** > --- 446,454 ---- > * Something was > received from master, so reset > * timeout > */ > + > + pg_usleep(100000); /* > be slow about processing */ > + > last_recv_timestamp = > GetCurrentTimestamp(); > ping_sent = false; > > XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
BTW, I neglected to mention that this hack also produces a second failure in the recovery test suite: [20:28:31] t/001_stream_rep.pl .................. 11/28 # Failed test 'xmin of cascaded slot null with no hs_feedback' # at t/001_stream_rep.pl line 178. # got: undef # expected: '' # Failed test 'catalog xmin of cascaded slot null with no hs_feedback' # at t/001_stream_rep.pl line 179. # got: undef # expected: '' [20:28:31] t/001_stream_rep.pl .................. 28/28 # Looks like you failed 2 tests of 28. [20:28:31] t/001_stream_rep.pl .................. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/28 subtests I had supposed last night that this was merely a matter of needing to insert a wait_slot_xmins() call before those tests, but doing so doesn't help. And, looking closer, the report is complaining that $xmin and $catalog_xmin are coming back as undef, not empty strings. My Perl-fu is too weak to figure out how that could possibly be --- the documentation for PostgresNode::query_hash certainly states flatly that it will never happen. And it even more certainly shouldn't happen if we already successfully probed the pg_replication_slots row in wait_slot_xmins(). The corresponding entries in the postmaster's log don't look abnormal: 2017-07-02 20:07:29.313 UTC [30862] LOG: database system is ready to accept read only connections 2017-07-02 20:07:29.318 UTC [30867] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-07-02 20:07:29.415 UTC [30870] t/001_stream_rep.pl LOG: statement: SELECT pg_create_physical_replication_slot('standby_2'); 2017-07-02 20:07:29.619 UTC [30863] LOG: redo starts at 0/3029458 2017-07-02 20:07:29.665 UTC [30882] t/001_stream_rep.pl LOG: statement: SELECT xmin IS NULL AND catalog_xmin IS NULL FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2'; 2017-07-02 20:07:29.679 UTC [30884] t/001_stream_rep.pl LOG: statement: SELECT plugin, slot_type, datoid, database, active, active_pid, xmin, catalog_xmin, restart_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2' (This is from a run with a wait_slot_xmins() call added, else the statement logged by PID 30882 would not have been issued.) Any ideas what's wrong there? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers