On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin <exclus...@gmail.com> wrote: > It takes less than 10 minutes on average for me. I checked > REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined > forcefully) — they all are affected. > I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H > undefined manually). And surprisingly for me, I could not reproduce it on > master and REL_16_STABLE. > `git bisect` for this behavior change pointed at 7389aad63 (though maybe it > just greatly decreased probability of the failure; I'm going to double-check > this). > In particular, that commit changed this: > - /* > - * Ignore SIGURG for now. Child processes may change this (see > - * InitializeLatchSupport), but they will not receive any such signals > - * until they wait on a latch. > - */ > - pqsignal_pm(SIGURG, SIG_IGN); /* ignored */ > -#endif > + /* This may configure SIGURG, depending on platform. */ > + InitializeLatchSupport(); > + InitProcessLocalLatch(); > > With debugging logging added I see (on 7389aad63~1) that one process > really sends SIGURG to another, and the latter reaches poll(), but it > just got no signal, it's signal handler not called and poll() just waits...
Thanks for working so hard on this Alexander. That is a surprising discovery! So changes to the signal handler arrangements in the *postmaster* before the child was forked affected this? > So it looks like the ARM weak memory model is not the root cause of the > issue. But as far as I can see, it's still specific to FreeBSD (but not > specific to a compiler — I used gcc and clang with the same success). Idea: FreeBSD 13 introduced a new mechanism called sigfastblock[1], which lets system libraries control signal blocking with atomic memory tricks in a word of user space memory. I have no particular theory for why it would be going wrong here (I don't expect us to be using any of the stuff that would use it, though I don't understand it in detail so that doesn't say much), but it occurred to me that all reports so far have been on 13.x or 14. I wonder... If you have a good fast recipe for reproducing this, could you also try it on FreeBSD 12.4? [1] https://man.freebsd.org/cgi/man.cgi?query=sigfastblock&sektion=2&manpath=FreeBSD+13.0-current