Hi, On 2022-01-11 22:59:13 +1300, Thomas Munro wrote: > I considered another idea we discussed: if we see a latch event, clear > it and try again so that other events can be revealed (rince and > repeat), but remember if that happens and set the latch at the end. I > think that still requires PG_FINALLY() if you want to guarantee not to > eat a latch event if WaitEventSetWait() throws. This may be a > theoretical point because things must be pretty broken if > WaitEventSetWait() is throwing, but I don't like an egregious lack of > exception safety on principle.
I don't think this is a problem. Not because of WaitEventSetWait() never throwing, but because it's "just fine" to have reset the latch in that case. The error will cause control flow to transfer to the next PG_CATCH site. The point of latches is to avoid racy checks for events (or sleeps with short timeouts to handle the races). The documented pattern is: * for (;;) * { * ResetLatch(); * if (work to do) * Do Stuff(); * WaitLatch(); * } Latches only work if there's a very limited amount of things happening between the if (work_to_do) and the WaitLatch(). It definitely is *NOT* be ok to do something like: for (;;) { ResetLatch() if (work_to_do) DoStuff(); PG_TRY(): something_that_may_throw(); PG_CATCH(): something_not_throwing(); WaitLatch(); } For one, elog.c related code might have actually done network IO! During which the latch very well might be reset. So there's just no way any remotely reasonable code can rely on preserving latch state across errors. > I considered another idea we discussed: if we see a latch event, clear > it and try again so that other events can be revealed (rince and > repeat), but remember if that happens and set the latch at the end. The more I think about it, the less I see why we *ever* need to re-arm the latch in pq_check_connection() in this approach. pq_check_connection() is only used from from ProcessInterrupts(), and there's plenty things inside ProcessInterrupts() that can cause latches to be reset (e.g. parallel message processing causing log messages to be sent to the client, causing network IO, which obviously can do a latch reset). It makes sense to use CFI() in a latch loop, but it would have to be at the bottom or top, not between if (work_to_do) and WaitLatch(). Greetings, Andres Freund