On 2014-05-16 17:02:33 -0400, Steve Singer wrote: > >I don't think that's going to cut it though. The creation can take > >longer than whatever wal_sender_timeout is set to (when there's lots of > >longrunning transactions). I think checking whether last_reply_timestamp > >= 0 during timeout checking is more robust.
> That makes sense. > A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>From 324d74e16dd8ee2a0fa977d92a269fd43a746196 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Sat, 17 May 2014 01:22:01 +0200 Subject: [PATCH] Don't pay heed to wal_sender_timeout while creating a decoding slot. Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for futher WAL using WalSndWaitForWal(). That used to always respect wal_sender_timeout and kill the session when waiting long enough because no feedback/ping messages can be sent while the slot is still being created. Add the notion that last_reply_timestamp = 0 means that the walsender currently doesn't need timeout processing and add checks for it in a couple of places. Bugreport and initial patch by Steve Singer, revised by Andres Freund. --- src/backend/replication/walsender.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 5c11d68..90394ce 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -148,9 +148,10 @@ static StringInfoData reply_message; static StringInfoData tmpbuf; /* - * Timestamp of the last receipt of the reply from the standby. + * Timestamp of the last receipt of the reply from the standby. Set to 0 if a + * the process currently shouldn't be killed by wal_sender_timeout. */ -static TimestampTz last_reply_timestamp; +static TimestampTz last_reply_timestamp = 0; /* Have we sent a heartbeat message asking for reply, since last reply? */ static bool waiting_for_ping_response = false; @@ -796,6 +797,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) logical_read_xlog_page, WalSndPrepareWrite, WalSndWriteData); + /* + * Signal that we don't need the timeout mechanism. We're just + * creating the replication slot and don't yet accept feedback + * messages or send keepalives. As we possibly need to wait for + * further WAL the walsender would possibly be killed too soon. + */ + last_reply_timestamp = 0; + /* build initial snapshot, might take a while */ DecodingContextFindStartpoint(ctx); @@ -1693,7 +1702,7 @@ WalSndComputeSleeptime(TimestampTz now) { long sleeptime = 10000; /* 10 s */ - if (wal_sender_timeout > 0) + if (wal_sender_timeout > 0 && last_reply_timestamp > 0) { TimestampTz wakeup_time; long sec_to_timeout; @@ -1735,6 +1744,10 @@ WalSndCheckTimeOut(TimestampTz now) { TimestampTz timeout; + /* don't bail out if we're doing something that doesn't require timeouts */ + if (last_reply_timestamp <= 0) + return; + timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); @@ -2879,7 +2892,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now) { TimestampTz ping_time; - if (wal_sender_timeout <= 0) + /* + * Don't send keepalive messages if timeouts are globally disabled or + * we're doing something not partaking in timeouts. + */ + if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0) return; if (waiting_for_ping_response) -- 2.0.0.rc2.4.g1dc51c6.dirty
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers