>>> When sending a big message, WalSndWriteData() notices that it's >>> approaching timeout and tries to send a keepalive request, but the >>> request just gets buffered behind the remaining output plugin data and >>> isn't seen by the client until the client has received the rest of the >>> pending data. >> >> Only for individual messages, not the entire transaction though.
>Right. I initially thought it was the whole tx, but I was mistaken as >I'd failed to notice that WalSndWriteData() queues a keepalive >request. This problem can be resolve periodically send keepalive by client, and this interval should be less than timeout configure on server. For example on server configure timeout wal_sender_timeout=60 so, client should send keep alive message to server with interval 60/3. In that case server will not send keep alive with flag required reply, and also not disconnect client because during decode huge transaction present check income messages. I faced a similar problem in pgjdc and resolve it as I write before. 2016-10-31 16:28 GMT+03:00 Craig Ringer <cr...@2ndquadrant.com>: > On 31 October 2016 at 16:52, Andres Freund <and...@anarazel.de> wrote: > > Hi, > > > > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote: > >> TL;DR: Logical decoding clients need to generate their own keepalives > >> and not rely on the server requesting them to prevent timeouts. Or > >> admins should raise the wal_sender_timeout by a LOT when using logical > >> decoding on DBs with any big rows. > > > > Unconvinced. > > Yeah. I've seen enough issues in the wild where we keep timing out and > restarting over and over until we increase wal_sender_timeout to know > there's _something_ going on. I am less sure I'm right about what is > or how to solve it. > > >> When sending a big message, WalSndWriteData() notices that it's > >> approaching timeout and tries to send a keepalive request, but the > >> request just gets buffered behind the remaining output plugin data and > >> isn't seen by the client until the client has received the rest of the > >> pending data. > > > > Only for individual messages, not the entire transaction though. > > Right. I initially thought it was the whole tx, but I was mistaken as > I'd failed to notice that WalSndWriteData() queues a keepalive > request. > > > Are > > you sure the problem at hand is that we're sending a keepalive, but it's > > too late? > > No, I'm not sure. I'm trying to identify the cause of an issue I've > seen in the wild, but never under conditions where it's been possible > to sit around and debug in a leisurely manner. > > I'm trying to set up a TAP test to demonstrate that this happens, but > I don't think it's going to work without some kind of network > bandwidth limitation simulation or simulated latency. A local unix > socket is just too fast for Pg's row size limits. > > > It might very well be that the actual issue is that we're > > never sending keepalives, because the network is fast enough / the tcp > > window is large enough. IIRC we only send a keepalive if we're blocked > > on network IO? > > Mm, that's a good point. That might better explain the issues I've > seen in the wild, since I never found strong evidence that individual > big rows were involved, but hadn't been able to come up with anything > else yet. > > >> So: We could ask output plugins to deal with this for us, by chunking > >> up their data in small pieces and calling OutputPluginPrepareWrite() > >> and OutputPluginWrite() more than once per output plugin callback if > >> they expect to send a big message. But this pushes the complexity of > >> splitting up and handling big rows, and big Datums, onto each plugin. > >> It's awkward to do well and hard to avoid splitting things up > >> unnecessarily. > > > > There's decent reason for doing that independently though, namely that > > it's a lot more efficient from a memory management POV. > > Definitely. Though you're always going to be tossing around ridiculous > chunks of memory when dealing with big external compressed toasted > data, unless there are ways to access that progressively that I'm > unaware of. Hopefully there are. > > I'd quite like to extend the bdr/pglogical/logicalrep protocol so that > in-core logical rep, in some later version, can write a field as 'to > follow', like we currently mark unchanged toasted datums separately. > Then send it chunked, after the main row, in follow-up messages. That > way we keep processing keepalives, we don't allocate preposterous > amounts of memory, etc. > > > I don't think the "unrequested keepalive" approach really solves the > > problem on a fundamental enough level. > > Fair. It feels a bit like flailing in the dark, too. > > >> (A separate issue is that we can also time out when doing logical > >> _replication_ if the downstream side blocks on a lock, since it's not > >> safe to send on a socket from a signal handler ... ) > > > > That's strictly speaking not true. write() / sendmsg() are signal safe > > functions. There's good reasons not to do that however, namely that the > > non signal handler code might be busy writing data itself. > > Huh, ok. And since in pglogical/bdr and as far as I can tell in core > logical rep we don't send anything on the socket while we're calling > in to heap access, the executor, etc, that'd actually be an option. We > could possibly safeguard it with a volatile "socket busy" flag since > we don't do much sending anyway. But I'd need to do my reading on > signal handler safety etc. Still, good to know it's not completely > absurd to do this if the issue comes up. > > Thanks very much for the input. I saw your post with proposed changes. > Once I can get the issue simulated reliably I'll test the patch and > see if it solves it, but it looks like it's sensible to just apply it > anyway TBH. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >