Hi all I've been investigating some intermittent timeout issues I've seen with BDR users in the wild and I think I've identified an issue with how logical decoding sends data and handles feedback.
TL;DR: Logical decoding clients need to generate their own keepalives and not rely on the server requesting them to prevent timeouts. Or admins should raise the wal_sender_timeout by a LOT when using logical decoding on DBs with any big rows. An output plugin accumulates data in a StringInfo which is then sent by WalSndWriteData(), which does nonblocking sends, checks for interrupts, processes client replies, and requests keepalives if approaching walsender timeout. A problem arises when the client's buffered data is big enough that it takes more than wal_sender_timeout to transmit the whole chunk and get a reply from the client on the link. A slow link, a big row, or some badly timed packet loss can easily result in timeout. Unlike in physical replication, a timeout on a logical decoding session can be expensive: * We have to restart logical decoding at restart_lsn, which can be a LONG time ago if there are any long running tx's, and decode up to the last confirmed xact; * We then have to resend the whole xact we were working on when we timed out, which might be quite big. so we should do our best to avoid timeouts in logical decoding sessions. Physical replication chunks data into small pieces and can restart cheaply, but logical decoding will send a 100MB+ row as a single message and have to re-read hundreds of GB of WAL and resend gigabytes of data on the wire to get back where it started after a timeout. When sending a big message, WalSndWriteData() notices that it's approaching timeout and tries to send a keepalive request, but the request just gets buffered behind the remaining output plugin data and isn't seen by the client until the client has received the rest of the pending data. The keepalive requests are useful so long as no one message takes more than wal_sender_timeout to send since it'll defer timeout and give us enough time to process the next message. But they won't prevent timeout within a single big message. How to fix this? We can't inject keepalive requests in the middle of a CopyData message. After the cancel-message discussion I know better than to suggest using out-of-band tcp data and SIGURG. So: We could ask output plugins to deal with this for us, by chunking up their data in small pieces and calling OutputPluginPrepareWrite() and OutputPluginWrite() more than once per output plugin callback if they expect to send a big message. But this pushes the complexity of splitting up and handling big rows, and big Datums, onto each plugin. It's awkward to do well and hard to avoid splitting things up unnecessarily. We could extend the logical decoding protocol slightly, so that the CopyData messages that carry output plugin payloads are limited to some fixed, reasonable size (1MB? 10MB?) where the CopyData message overhead is insignificant and most rows won't need to be reassembled from multiple CopyData messages by the downstream. This gives us gaps in which we can inject keepalives / status updates during big messages so we get the client replies and process them. This won't fix existing releases though, and makes clients reassemble the chunks. It also means clients can't know the full size of a message based on the CopyData length anymore so it'll force clients that rely on the CopyData message length to add their own length fields in the payload. CopyData doesn't have anything we can use as a "continuation message follows" flag. We could permit the walsender to reset the keepalive timer whenever it confirms that it's sent a reasonable chunk of data such that it's unlikely to just be lurking in the kernel's socket send buffer or various routers in transit. We don't care that the client received up to any particular point, only that it's alive and making progress. We know the client must be receiving and acking it since we're using TCP. The problem is knowing what "enough' data sent to mean the client must've received some is. The socket send buffer is easy to check and bounded, but TCP window scaling means there can be a huge amount of un-acked data in transit. AFAIK we have no visibility into the network stack's view of what's acked, at least portably. Linux has TCP_INFO but it's hard to use and nonportable. So we can't assume that even successfully sending 1GB of data means the client got data 1GB-ago unless we clamp the congestion window or do other possibly-performance-affecting things. Or we could expect logical decoding downstream clients to send their own proactive feedback when receiving and processing large messages. The walsender is already capable of noticing and processing such replies, just not requesting them when it should. Again this pushes complexity onto each client, but less so than requiring output plugins to chunk up their messages. Also, since each client-initiated feedback message resets the walsender's waiting_for_ping_response flag, it'll queue multiple keepalives after big messages and we'll process them in a burst. Annoying but minor. I think the least bad is to have clients generate their own feedback messages, unprompted by the upstream, when enough time has elapsed. Unfortunately wal_sender_timeout isn't sent in the initial params by the server and the walsender protocol doesn't seem to have any way to ask the server for parameter values so you can't check the upstream's wal_sender_timeout and have to just pick a sensible default. (A separate issue is that we can also time out when doing logical _replication_ if the downstream side blocks on a lock, since it's not safe to send on a socket from a signal handler and you can't interrupt the lock wait to do some work then resume the lock wait. But that's for later...) Search kw: * Logical decoding * terminating walsender process due to replication timeout * wal_sender_timeout -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers