On Mon, Jan 29, 2018 at 10:09:49AM +1300, Thomas Munro wrote: > https://travis-ci.org/postgresql-cfbot/postgresql/builds/334334417 > > # Failed test 'pg_recvlogical acknowledged changes, nothing pending on slot' > # at t/006_logical_decoding.pl line 91. > # got: 'BEGIN > # table public.decoding_test: INSERT: x[integer]:1 y[text]:'1' > # table public.decoding_test: INSERT: x[integer]:2 y[text]:'2' > # table public.decoding_test: INSERT: x[integer]:3 y[text]:'3' > # table public.decoding_test: INSERT: x[integer]:4 y[text]:'4' > # COMMIT' > # expected: '' > # Looks like you failed 1 test of 16.
The problem is this StreamLogicalLog() code: if (PQresultStatus(res) == PGRES_COPY_OUT) { /* * We're doing a client-initiated clean exit and have sent CopyDone to * the server. We've already sent replay confirmation and fsync'd so * we can just clean up the connection now. */ goto error; } Once pg_recvlogical receives the XLogData containing the sought-after end position, that code makes pg_recvlogical exit without draining the remainder of the backend messages. If pg_recvlogical exits quickly enough, the backend send()s libpq messages after pg_recvlogical disconnects, which can cause internal_flush() to fail with EPIPE ("LOG: could not send data to client: Broken pipe"). If that precedes LogicalConfirmReceivedLocation(), the test fails as you experienced. Such failure happened once on the buildfarm[1]; post-LogicalConfirmReceivedLocation() EPIPE also happens[2]. The first attached patch causes this failure to happen almost every run. The fix (second attachment) is to call PQgetCopyData() until no records remain, then issue PQresultStatus() again[3]. This closes an additional defect, described in the log message. I looked at the other instances of "Broken pipe" in a couple of check-world runs. Clients might prevent those with cleaner shutdown on error, but it's cosmetic. They appeared in cases where the client or the server had already recognized some other failure, whereas $SUBJECT taints a successful run. This led me to notice other pg_recvlogical bugs, which I left unchanged: 1) An "unexpected termination of replication stream" error doesn't preclude exit(0). 2) sigint_handler() doesn't trigger a PQputCopyEnd(). The connection status remains PGRES_COPY_BOTH, prompting this weird message: $ pg_recvlogical --create-slot --start -S foo -d postgres -f- && echo success ^Cpg_recvlogical: error: unexpected termination of replication stream: success Other aspects of signal handling surprised me, but they may not be bugs. The documentation says that --start continues "until terminated by a signal". We don't trap SIGTERM, just SIGINT (initiate clean exit) and SIGHUP (reopen output file). pg_recvlogical copied SIGINT behavior from pg_receivewal, and the pg_receivewal documentation is specific about signals. Both programs react to SIGINT with exit(0), whether or not they reached --endpos. sigint_handler() doesn't trigger a flushAndSendFeedback(), so the slot's next pg_recvlogical will repeat messages that followed the last fsync/feedback. 3) sendFeedback() wants s/last_fsync_lsn !=/last_fsync_lsn ==/. This just changes the volume of feedback messages. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogfish&dt=2020-03-07%2018%3A49%3A34 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=florican&dt=2020-04-27%2017%3A25%3A08&stg=recovery-check [3] https://www.postgresql.org/docs/devel/libpq-copy.html writes "After PQgetCopyData returns -1, call PQgetResult to obtain the final result status of the COPY command. One can wait for this result to be available in the usual way. Then return to normal operation."
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> (not for commit) Reproduce failure seen here: https://postgr.es/m/CAEepm=1MzM2Z_xNe4foGwZ1a+MO_2S9oYDq3M5D11=jdu_+...@mail.gmail.com https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogfish&dt=2020-03-07%2018%3A49%3A34 diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 7717bb2..2268009 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -1448,7 +1448,8 @@ internal_flush(void) last_reported_send_errno = errno; ereport(COMMERROR, (errcode_for_socket_access(), - errmsg("could not send data to client: %m"))); + errmsg("could not send data to client: %m"), + errbacktrace())); } /* diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index ae4a9cb..d660038 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1229,6 +1229,19 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, if (pq_flush_if_writable() != 0) WalSndShutdown(); + /* + * If this XLogData message causes pg_recvlogical to reach its --endpos, + * pg_recvlogical will exit during this sleep. + */ + pg_usleep(15 * 1000); + /* + * Send a long NoticeResponse. If the peer has exited, internal_flush() + * will observe EPIPE and ereport(COMMERROR). + */ + elog(WARNING, "long notice %065535d", 7); + if (pq_flush_if_writable() != 0) + WalSndShutdown(); + /* Try taking fast path unless we get too close to walsender timeout. */ if (now < TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout / 2) &&
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> In successful pg_recvlogical, end PGRES_COPY_OUT cleanly. pg_recvlogical merely called PQfinish(), so the backend sent messages after the disconnect. When that caused EPIPE in internal_flush(), before a LogicalConfirmReceivedLocation(), the next pg_recvlogical would repeat already-acknowledged records. Whether or not the defect causes EPIPE, post-disconnect messages could contain an ErrorResponse that the user should see. One properly ends PGRES_COPY_OUT by repeating PQgetCopyData() until it returns a negative value. Augment one of the tests to cover the case of WAL past --endpos. Back-patch to v10, where commit 7c030783a5bd07cadffc2a1018bc33119a4c7505 first appeared. Before that commit, pg_recvlogical never reached PGRES_COPY_OUT. Reviewed by FIXME. Reported by Thomas Munro. Discussion: https://postgr.es/m/CAEepm=1MzM2Z_xNe4foGwZ1a+MO_2S9oYDq3M5D11=jdu_+...@mail.gmail.com diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c index 0da3277..1304b27 100644 --- a/src/bin/pg_basebackup/pg_recvlogical.c +++ b/src/bin/pg_basebackup/pg_recvlogical.c @@ -580,14 +580,40 @@ StreamLogicalLog(void) res = PQgetResult(conn); if (PQresultStatus(res) == PGRES_COPY_OUT) { + PQclear(res); + /* * We're doing a client-initiated clean exit and have sent CopyDone to - * the server. We've already sent replay confirmation and fsync'd so - * we can just clean up the connection now. + * the server. Drain any messages, so we don't miss a last-minute + * ErrorResponse. The walsender stops generating XLogData records once + * it sees CopyDone, so expect this to finish quickly. After CopyDone, + * it's too late for sendFeedback(), even if this were to take a long + * time. Hence, use synchronous-mode PQgetCopyData(). */ - goto error; + while (1) + { + int r; + + if (copybuf != NULL) + { + PQfreemem(copybuf); + copybuf = NULL; + } + r = PQgetCopyData(conn, ©buf, 0); + if (r == -1) + break; + if (r == -2) + { + pg_log_error("could not read COPY data: %s", + PQerrorMessage(conn)); + time_to_abort = false; /* unclean exit */ + goto error; + } + } + + res = PQgetResult(conn); } - else if (PQresultStatus(res) != PGRES_COMMAND_OK) + if (PQresultStatus(res) != PGRES_COMMAND_OK) { pg_log_error("unexpected termination of replication stream: %s", PQresultErrorMessage(res)); diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl index 0bcd1d0..1334bf6 100644 --- a/src/test/recovery/t/006_logical_decoding.pl +++ b/src/test/recovery/t/006_logical_decoding.pl @@ -71,6 +71,11 @@ my $endpos = $node_master->safe_psql('postgres', ); print "waiting to replay $endpos\n"; +# Insert some rows after $endpos, which we won't read. +$node_master->safe_psql('postgres', + qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(5,50) s;] +); + my $stdout_recv = $node_master->pg_recvlogical_upto( 'postgres', 'test_slot', $endpos, 180, 'include-xids' => '0', @@ -89,7 +94,7 @@ $stdout_recv = $node_master->pg_recvlogical_upto( 'skip-empty-xacts' => '1'); chomp($stdout_recv); is($stdout_recv, '', - 'pg_recvlogical acknowledged changes, nothing pending on slot'); + 'pg_recvlogical acknowledged changes'); $node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');