I wrote:
> Probably this requires a relcache inval at the wrong time;
> although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
> so that can't be the whole triggering condition. I wonder whether
> it is relevant that all of the complaining animals are JIT-enabled.
Hmmm ... I take that back. hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*. So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.
Realizing that, I built an installation that way and tried to run
the subscription tests. Results so far:
* Running 010_truncate.pl by itself passed for me. So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)
* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop. It does not complete the third test step,
rather infinitely waiting for progress to be made. The publisher log
shows a repeating loop like
2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG: could not send data to
client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT: slot "tap_sub", output
plugin "pgoutput", in the commit callback, associated LSN 0/1660628
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG: statement: SELECT
pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: received replication
command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2',
publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: starting logical decoding
for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: Streaming transactions
committing after 0/1652820, reading WAL from 0/1651B20.
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG: logical decoding found
consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL: There are no running
transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG: statement: SELECT
pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG: statement: SELECT
pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG: statement: SELECT
pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG: received replication
command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2',
publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR: replication slot "tap_sub"
is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG: received replication
command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2',
publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR: replication slot "tap_sub"
is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG: statement: SELECT
pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM
pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub';
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG: received replication
command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2',
publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR: replication slot "tap_sub"
is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG: received replication
command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/1652820 (proto_version '2',
publication_names '"tap_pub","tap_pub_ins_only"')
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR: replication slot "tap_sub"
is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG: could not send data to
client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT: slot "tap_sub", output
plugin "pgoutput", in the commit callback, associated LSN 0/1660628
while the subscriber is repeating
2020-09-13 21:15:01.598 EDT [928528] LOG: logical replication apply worker for
subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR: terminating logical replication
worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG: background worker "logical
replication worker" (PID 928528) exited with exit code 1
2020-09-13 21:16:02.606 EDT [928571] LOG: logical replication apply worker for
subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR: could not start WAL streaming:
ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:03.118 EDT [920797] LOG: background worker "logical
replication worker" (PID 928571) exited with exit code 1
2020-09-13 21:16:03.544 EDT [928574] LOG: logical replication apply worker for
subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR: could not start WAL streaming:
ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.054 EDT [920797] LOG: background worker "logical
replication worker" (PID 928574) exited with exit code 1
2020-09-13 21:16:04.479 EDT [928576] LOG: logical replication apply worker for
subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR: could not start WAL streaming:
ERROR: replication slot "tap_sub" is active for PID 928529
2020-09-13 21:16:04.990 EDT [920797] LOG: background worker "logical
replication worker" (PID 928576) exited with exit code 1
2020-09-13 21:16:05.415 EDT [928579] LOG: logical replication apply worker for
subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR: terminating logical replication
worker due to timeout
I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.
regards, tom lane