On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <and...@2ndquadrant.com> wrote:
>> This happened again, and I investigated further. It looks like the
>> postmaster knows full well that it's supposed to start more bgworkers:
>> the ones that never get started are in the postmaster's
>> BackgroundWorkerList, and StartWorkerNeeded is true. But it only
>> starts the first one, not all three. Why?
>> Here's my theory. When I did a backtrace inside the postmaster, it
>> was stuck inside inside select(), within ServerLoop(). I think that's
>> just where it was when the backend that wanted to run test_shm_mq
>> requested that a few background workers get launched. Each
>> registration would have sent the postmaster a separate SIGUSR1, but
>> for some reason the postmaster only received one, which I think is
>> legit behavior, though possibly not typical on modern Linux systems.
>> When the SIGUSR1 arrived, the postmaster jumped into
>> sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
>> which launched the first background worker. Then it returned, and the
>> arrival of the signal did NOT interrupt the pending select().
>> This chain of events can't occur if an arriving SIGUSR1 causes
>> select() to return EINTR or EWOULDBLOCK, nor can it happen if the
>> signal handler is entered three separate times, once for each SIGUSR1.
>> That combination of explanations seems likely sufficient to explain
>> why this doesn't occur on other machines.
>> The code seems to have been this way since the commit that introduced
>> background workers (da07a1e856511dca59cbb1357616e26baa64428e),
>> although the function was called StartOneBackgroundWorker back then.
> If that theory is true, wouldn't things get unstuck everytime a new
> connection comes in? Or 60 seconds have passed? That's not to say this
> isn't wrong, but still?
There aren't any going to be any new connections arriving when running
the contrib regression tests, I believe, so I don't think there is an
escape hatch there. I didn't think to check how timeout was set in
ServerLoop, and it does look like the maximum ought to be 60 seconds,
so either there's some other ingredient I'm missing here, or the whole
theory is just wrong altogether. :-(
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: