On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > So I did that - same configure options as the buildfarm client, and a > 'make check' (with only tests up to the 'join' suite, because that's > where it got stuck before). And it took only ~15 runs (~1h) to hit this > again on dikkop.
That's good news. > I managed to collect the fstat/procstat stuff Thomas asked for, and the > backtraces - attached. I still have the core files, in case we look at > something. As before, running gcore on the second worker (29081) gets > this unstuck - it sends some signal that apparently wakes it up. Thanks! As expected, no bytes in the pipe for any those processes. Unfortunately I gave the wrong procstat command, it should be -i, not -j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for that stuck process? Silly question really, I don't really expect poll() to be misbehaving in such a basic way. I was talking to Andres on IM about this yesterday and he pointed out a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to tell the signal handler to write to the self-pipe) and then reads latch->is_set with neither compiler nor memory barrier, which doesn't seem right because we might see a value of latch->is_set from before "waiting" was true, and yet the signal handler might also have run while "waiting" was false so the self-pipe doesn't save us, despite the length of the comment about that. Can you reproduce it with this change? --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, * ordering, so that we cannot miss seeing is_set if a notificat ion * has already been queued. */ + pg_memory_barrier(); if (set->latch && set->latch->is_set) { occurred_events->fd = PGINVALID_SOCKET;