On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmh...@gmail.com> wrote: > > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.p...@enterprisedb.com> > > wrote: > >> Hamid@EDB; Can you please have someone configure anole to build git > >> head as well as the other branches? Thanks. > > > > The test_shm_mq regression tests hung on this machine this morning. > > Hamid was able to give me access to log in and troubleshoot. > > Unfortunately, I wasn't able to completely track down the problem > > before accidentally killing off the running cluster, but it looks like > > test_shm_mq_pipelined() tried to start 3 background workers and the > > postmaster only ever launched one of them, so the test just sat there > > and waited for the other two workers to start. At this point, I have > > no idea what could cause the postmaster to be asleep at the switch > > like this, but it seems clear that's what happened. > > This happened again, and I investigated further. It looks like the > postmaster knows full well that it's supposed to start more bgworkers: > the ones that never get started are in the postmaster's > BackgroundWorkerList, and StartWorkerNeeded is true. But it only > starts the first one, not all three. Why? > > Here's my theory. When I did a backtrace inside the postmaster, it > was stuck inside inside select(), within ServerLoop(). I think that's > just where it was when the backend that wanted to run test_shm_mq > requested that a few background workers get launched. Each > registration would have sent the postmaster a separate SIGUSR1, but > for some reason the postmaster only received one, which I think is > legit behavior, though possibly not typical on modern Linux systems. > When the SIGUSR1 arrived, the postmaster jumped into > sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), > which launched the first background worker. Then it returned, and the > arrival of the signal did NOT interrupt the pending select(). > > This chain of events can't occur if an arriving SIGUSR1 causes > select() to return EINTR or EWOULDBLOCK, nor can it happen if the > signal handler is entered three separate times, once for each SIGUSR1. > That combination of explanations seems likely sufficient to explain > why this doesn't occur on other machines. > > The code seems to have been this way since the commit that introduced > background workers (da07a1e856511dca59cbb1357616e26baa64428e), > although the function was called StartOneBackgroundWorker back then.
If that theory is true, wouldn't things get unstuck everytime a new connection comes in? Or 60 seconds have passed? That's not to say this isn't wrong, but still? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers