Re: [HACKERS] Instability in select_parallel regression test

Amit Kapila Fri, 17 Feb 2017 18:58:14 -0800

On Fri, Feb 17, 2017 at 9:15 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> On Fri, Feb 17, 2017 at 11:22 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> In short, it looks to me like ExecShutdownGatherWorkers doesn't actually
>>> wait for parallel workers to finish (as its comment suggests is
>>> necessary), so that on not-too-speedy machines the worker slots may all
>>> still be in use when the next command wants some.
>
>> ExecShutdownGatherWorkers() do wait for workers to exit/finish, but it
>> doesn't wait for the postmaster to free the used slots and that is how
>> that API is supposed to work.  There is good chance that on slow
>> machines the slots get freed up much later by postmaster after the
>> workers have exited.
>
> That seems like a seriously broken design to me, first because it can make
> for a significant delay in the slots becoming available (which is what's
> evidently causing these regression failures), and second because it's
> simply bad design to load extra responsibilities onto the postmaster.
> Especially ones that involve touching shared memory.
>
> I think this needs to be changed, and promptly.  Why in the world don't
> you simply have the workers clearing their slots when they exit?
>


There seem to be many reasons why exit of background workers is done
by postmaster like when they have to restart after a crash or if
someone terminates them (TerminateBackgroundWorker()) or if the
notification has to be sent at exit (bgw_notify_pid).  Moreover, there
can be background workers without shared memory access as well which
can't clear state from shared memory.  Another point to think is that
background workers contain user-supplied code, so not sure how good it
is to give them control of tinkering with shared data structures of
the backend.  Now, one can argue that for some of the cases where it
is possible background worker should cleanup shared memory state at
the exit, but I think it is not directly related to the parallel
query.  As far as the parallel query is concerned, it just needs to
wait for workers exit to ensure that no more operations can be
performed in workers after that point so that it can accumulate stats
and retrieve some fixed parallel state information.

> We don't have an expectation that regular backends are incompetent to
> clean up after themselves.  (Obviously, a crash exit is a different
> case.)
>
>> I think what we need to do
>> here is to move the test that needs workers to execute before other
>> parallel query tests where there is no such requirement.
>
> That's not fixing the problem, it's merely averting your eyes from
> the symptom.
>

I am not sure why you think so.  Parallel query is capable of running
without workers even if the number of planned workers are not
available and there are many reasons for same.  In general, I think
the tests should not rely on the availability of background workers
and if there is a test like that then it should be the responsibility
of the test to ensure the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Instability in select_parallel regression test

Reply via email to