Re: [HACKERS] test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

Andres Freund Fri, 03 Oct 2014 14:48:48 -0700

On 2014-10-03 14:38:10 -0400, Robert Haas wrote:
> On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <[email protected]> wrote:
> > Further debugging reveals that sigusr1_handler() gets called
> > repeatedly, to start autovacuum workers, and it keeps waking up and
> > starting them.  But that doesn't cause the background workers to get
> > started either, because although sigusr1_handler() contains a call to
> > maybe_start_bgworker, it only does that if start_bgworker = true,
> > which only happens if
> > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> > which on these calls it isn't.
> > And I think this might also be the missing ingredient answering
> > Andres's question from before: why doesn't the 60-second
> > select()-timeout cause the background worker to eventually start even
> > if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
> > arriving about every 3 seconds, and I bet that's resetting the timer
> > every time.


Ick. There's a couple reports about HPUX behaving that way for select(),
so that seems like a reasonable theory. And a good explanation why only
anole was seing this.

> For now I propose to address this by committing the attached patch,
> which gets rid of the separate start_bgworker flag inside
> sigusr1_handler() and instead uses the same StartWorkerNeeded ||
> HaveCrashedWorker test that we use inside ServerLoop() to decide
> whether to call maybe_start_bgworker().

That seems sane.

> Either more signals will
> arrive (in which case the signal handler will launch an additional
> background worker every time a signal arrives) or they won't (in which
> case the 60-second timeout will eventually expire, and ServerLoop()
> will kick into high gear and satisfy all outstanding requests).  This
> isn't really right, because there might still be a quite noticeable
> delay waiting for workers to get launched, but at least the delay
> would be bounded to at most 60 seconds rather than, as at present,
> potentially infinite.

Right, it will somewhat fix the current issue. But aren't the
consequences of a potentially never/very seldomly run ServerLoop() more
severe? There's a couple more things done in that loop :(

> A totally correct fix will require a bit more thought.  A search of
> the git history reveals that the problem of a signal restarting the
> timeout is not new: Tom fixed a similar problem back in 2007 by making
> the autovacuum launcher sleep for at most a second at a time.  Such a
> fix isn't ideal here, because we really don't want an up-to-1-second
> delay launching a newly-registered background worker if there's a way
> to avoid that -- it's probably OK for launching daemons, but it's not
> so hot for parallel query.  However, we could:
> 
> (1) Use the self-pipe trick. We could not directly use a latch, at
> least not without a new API, because we might be waiting on more than
> one socket.

I think this is the way to go. And I think we should make the latch API
support that case, instead of reinventing it. That'll also have the
advantage of automatically using poll() (which is more efficient),
rather than select(), when supported by the platform.

> (2) Have the postmaster not set SA_RESTART for the sigusr1 handler.  I
> don't know how platform-independent this approach would be.

This makes me wary. Postmaster doesn't do too much stuff, but it'd still
be annoying to have to make sure everything always does loop around
EINTR.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

Reply via email to