Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Tom Lane Wed, 14 Jun 2017 12:34:18 -0700

Yesterday lorikeet failed the select_parallel test in a new way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33


2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic 
shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel 
worker for PID 10072 (PID 11896) exited with exit code 1
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select 
stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - 
BackgroundWorkerData->parallel_terminate_count <= 1024)", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
 Line: 974)
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 10072) 
was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process was 
running: select stringu1::int2 from tenk1 where unique1 = 1;
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any other 
active server processes

So the first problem here is the lack of supporting information for the
'could not map' failure.  AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED.  It's far from clear why those cases should be
treated as a silent fail.  It's even less clear why dsm_attach's early
exit cases don't produce any messages.  But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

The subsequent assertion crash came from here:

    /*
     * If this is a parallel worker, check whether there are already too many
     * parallel workers; if so, don't register another one.  Our view of
     * parallel_terminate_count may be slightly stale, but that doesn't really
     * matter: we would have gotten the same result if we'd arrived here
     * slightly earlier anyway.  There's no help for it, either, since the
     * postmaster must not take locks; a memory barrier wouldn't guarantee
     * anything useful.
     */
    if (parallel && (BackgroundWorkerData->parallel_register_count -
                     BackgroundWorkerData->parallel_terminate_count) >=
        max_parallel_workers)
    {
        Assert(BackgroundWorkerData->parallel_register_count -
               BackgroundWorkerData->parallel_terminate_count <=
               MAX_PARALLEL_WORKER_LIMIT);
        LWLockRelease(BackgroundWorkerLock);
        return false;
    }

What I suspect happened is that parallel_register_count was less than
parallel_terminate_count; since those counters are declared as uint32,
the negative difference would have been treated as a large unsigned value,
triggering the assertion.  It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker.  It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again.  I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
between incrementing parallel_terminate_count and marking the slot free.
Maybe it doesn't need one, but since there is one in
BackgroundWorkerStateChange, it looks weird to not have it here.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

Reply via email to