Hi, On 2022-01-15 13:19:42 +1300, Thomas Munro wrote: > On Sat, Jan 15, 2022 at 11:44 AM Andres Freund <and...@anarazel.de> wrote: > > > The patch Alexander tested most recently uses a tri-state eof flag [...] > > > > What about instead giving WalReceiverConn an internal WaitEventSet, and > > using > > that consistently? I've attached a draft for that. > > > > Alexander, could you test with that patch applied? > > Isn't your patch nearly identical to one that I already posted, that > Alexander tested and reported success with here?
Sorry, somehow I missed that across the many patches in the thread... And yes, it does look remarkably similar. > https://www.postgresql.org/message-id/5d507424-13ce-d19f-2f5d-ab4c6a987316%40gmail.com > > I can believe that fixes walreceiver One thing that still bothers me around this is that we didn't detect the problem of the dead walreceiver connection, even after missing the FD_CLOSE. There's plenty other ways that a connection can get stalled for prolonged was, that we'd not get notified about either. That's why there's wal_receiver_timeout, after all. But from what I can see wal_receiver_timeout doesn't work even halfway reliably, because of all the calls down to libpqrcv_PQgetResult, where we just block indefinitely? This actually seems like a significant issue to me, and not just on windows. > (if we're sure that there isn't a libpq-changes-the-socket problem) I just don't see what that problem could be, once connection is established. The only way a the socket fd could change is a reconnect, which doesn't happen automatically. I actually was searching the archives for threads on it, but I didn't find much besides the references around [1]. And I didn't see a concrete risk explained there? > but AFAICS the same problem exists for postgres_fdw and async append. Perhaps - but I suspect it'll matter far less with them than with walreceiver. > That's why I moved to trying to > fix the multiple-WES thing (though of course I agree we should be > using long lived WESes wherever possible That approach seems like a very very leaky bandaid, with a decent potential for unintended consequences. Perhaps there's nothing better that we can do, but I'd rather try to fix the problem closer to the root... > I just didn't think that seemed back-patchable, so it's more of a feature > patch for the future). Hm, it doesn't seem crazy invasive to me. But I guess we might be looking at a revert of the shutdown changes for now anyway? In that case we should be fixing this anyway, but we might be able to afford doing it in master only? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BzCNJZBXcURPdQvdY-tjyD0y7Li2wZEC6XChyUej1S5w%40mail.gmail.com