While trying to understand a recent buildfarm failure [1], I got annoyed
by the fact that a walsender exposes its last SQL command in
pg_stat_activity even when that was a long time ago and what it's been
doing since then is replication commands. This leads to *totally*
misleading reports, for instance in [1] we read
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:4] LOG: server process (PID
2368853) was terminated by signal 11: Segmentation fault
2020-09-13 19:10:09.632 CEST [5f5e526d.242415:5] DETAIL: Failed process was
running: SELECT pg_catalog.set_config('search_path', '', false);
even though the process's own log messages paint a much better picture of
reality:
2020-09-13 19:10:07.302 CEST [5f5e526f.242555:1] LOG: connection received:
host=[local]
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:2] LOG: replication connection
authorized: user=bf application_name=sub2
2020-09-13 19:10:07.303 CEST [5f5e526f.242555:3] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2020-09-13 19:10:07.334 CEST [5f5e526f.242555:4] LOG: received replication
command: IDENTIFY_SYSTEM
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:5] LOG: received replication
command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '2',
publication_names '"pub2"')
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:6] LOG: starting logical
decoding for slot "sub2"
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:7] DETAIL: Streaming
transactions committing after 0/159EB38, reading WAL from 0/159EB00.
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:8] LOG: logical decoding found
consistent point at 0/159EB00
2020-09-13 19:10:07.335 CEST [5f5e526f.242555:9] DETAIL: There are no running
transactions.
I think that walsenders ought to report their current replication command
as though it were a SQL command. They oughta set debug_query_string, too.
While trying to fix this, I also observed that exec_replication_command
fails to clean up the temp context it made for parsing the command string,
if that turns out to be a SQL command. This very accidentally fails to
lead to a long-term memory leak, because that context will be a child of
MessageContext so it'll be cleaned out in the next iteration of
PostgresMain's main loop. But it's still unacceptably sloppy coding
IMHO, and it's closely related to a lot of other randomness in the
function; it'd be better to have a separate early-exit path for SQL
commands.
Attached find a proposed fix for these things.
What I'd really like to do is adjust pgstat_report_activity so that
callers are *required* to provide some kind of command string when
transitioning into STATE_RUNNING state, ie something like
Assert(state == STATE_RUNNING ? cmd_str != NULL : cmd_str == NULL);
However, before we could do that, we'd have to clean up the rat's nest
of seemingly-inserted-with-the-aid-of-a-dartboard pgstat_report_activity
calls in replication/logical/worker.c. I couldn't make heads or tails
of what is going on there, so I did not try.
BTW, while I didn't change it here, isn't the second
SnapBuildClearExportedSnapshot call in exec_replication_command just
useless? We already did that before parsing the command.
regards, tom lane
[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f756b470a..9a2b154797 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1545,6 +1545,9 @@ exec_replication_command(const char *cmd_string)
CHECK_FOR_INTERRUPTS();
+ /*
+ * Parse the command.
+ */
cmd_context = AllocSetContextCreate(CurrentMemoryContext,
"Replication command context",
ALLOCSET_DEFAULT_SIZES);
@@ -1560,15 +1563,38 @@ exec_replication_command(const char *cmd_string)
cmd_node = replication_parse_result;
+ /*
+ * If it's a SQL command, just clean up our mess and return false; the
+ * caller will take care of executing it.
+ */
+ if (IsA(cmd_node, SQLCmd))
+ {
+ if (MyDatabaseId == InvalidOid)
+ ereport(ERROR,
+ (errmsg("cannot execute SQL commands in WAL sender for physical replication")));
+
+ MemoryContextSwitchTo(old_context);
+ MemoryContextDelete(cmd_context);
+
+ /* Tell the caller that this wasn't a WalSender command. */
+ return false;
+ }
+
+ /*
+ * Report query to various monitoring facilities. For this purpose, we
+ * report replication commands just like SQL commands.
+ */
+ debug_query_string = cmd_string;
+
+ pgstat_report_activity(STATE_RUNNING, cmd_string);
+
/*
* Log replication command if log_replication_commands is enabled. Even
* when it's disabled, log the command with DEBUG1 level for backward
- * compatibility. Note that SQL commands are not logged here, and will be
- * logged later if log_statement is enabled.
+ * compatibility.
*/
- if (cmd_node->type != T_SQLCmd)
- ereport(log_replication_commands ? LOG : DEBUG1,
- (errmsg("received replication command: %s", cmd_string)));
+ ereport(log_replication_commands ? LOG : DEBUG1,
+ (errmsg("received replication command: %s", cmd_string)));
/*
* CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot. If it was
@@ -1578,10 +1604,9 @@ exec_replication_command(const char *cmd_string)
SnapBuildClearExportedSnapshot();
/*
- * For aborted transactions, don't allow anything except pure SQL, the
- * exec_simple_query() will handle it correctly.
+ * Disallow replication commands in aborted transaction blocks.
*/
- if (IsAbortedTransactionBlockState() && !IsA(cmd_node, SQLCmd))
+ if (IsAbortedTransactionBlockState())
ereport(ERROR,
(errcode(ERRCODE_IN_FAILED_SQL_TRANSACTION),
errmsg("current transaction is aborted, "
@@ -1597,9 +1622,6 @@ exec_replication_command(const char *cmd_string)
initStringInfo(&reply_message);
initStringInfo(&tmpbuf);
- /* Report to pgstat that this process is running */
- pgstat_report_activity(STATE_RUNNING, NULL);
-
switch (cmd_node->type)
{
case T_IdentifySystemCmd:
@@ -1651,17 +1673,6 @@ exec_replication_command(const char *cmd_string)
}
break;
- case T_SQLCmd:
- if (MyDatabaseId == InvalidOid)
- ereport(ERROR,
- (errmsg("cannot execute SQL commands in WAL sender for physical replication")));
-
- /* Report to pgstat that this process is now idle */
- pgstat_report_activity(STATE_IDLE, NULL);
-
- /* Tell the caller that this wasn't a WalSender command. */
- return false;
-
default:
elog(ERROR, "unrecognized replication command node tag: %u",
cmd_node->type);
@@ -1677,6 +1688,7 @@ exec_replication_command(const char *cmd_string)
/* Report to pgstat that this process is now idle */
pgstat_report_activity(STATE_IDLE, NULL);
+ debug_query_string = NULL;
return true;
}