Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

Robert Haas Fri, 03 Jun 2016 20:14:52 -0700

On Thu, May 26, 2016 at 5:57 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenre...@gmx.de>
> wrote:
>>
>>
>> Each of the sent plans was collected when a worker dumped core due to
>> the failed assertion.  More core dumps than plans were actually
>> observed, since with this failed assertion, multiple workers usually
>> trip on and dump core simultaneously.
>>
>> The following query corresponds to plan2:
>>
>> --8<---------------cut here---------------start------------->8---
>> select
>>   pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
>>   subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
>>   (select unique1 from public.bprime limit 1 offset 9) as c5,
>>   subq_0.c2 as c6
>> from
>> (select ref_0.tablename as c0, ref_0.inherited as c1,
>>         ref_0.histogram_bounds as c2, 100 as c3
>>       from
>>         pg_catalog.pg_stats as ref_0
>>       where 49 is not NULL limit 55) as subq_0
>> where true
>> limit 58;
>> --8<---------------cut here---------------end--------------->8---
>>
>
> I am able to reproduce the assertion (it occurs once in two to three times,
> but always at same place) you have reported upthread with the above query.
> It seems to me, issue here is that while workers are writing tuples in the
> tuple queue, the master backend has detached from the queues.  The reason
> why master backend has detached from tuple queues is because of Limit
> clause, basically after processing required tuples as specified by Limit
> clause, it calls shutdown of nodes in below part of code:


I can't reproduce this assertion failure on master.  I tried running
'make installcheck' and then running this query repeatedly in the
regression database with and without
parallel_setup_cost=parallel_tuple_cost=0, and got nowhere.  Does that
work for you, or do you have some other set of steps?

> I think the workers should stop processing tuples after the tuple queues got
> detached.  This will not only handle above situation gracefully, but will
> allow to speed up the queries where Limit clause is present on top of Gather
> node.  Patch for the same is attached with mail (this was part of original
> parallel seq scan patch, but was not applied and the reason as far as I
> remember was we thought such an optimization might not be required for
> initial version).

This is very likely a good idea, but...

> Another approach to fix this issue could be to reset mqh_partial_bytes and
> mqh_length_word_complete in shm_mq_sendv  in case of SHM_MQ_DETACHED.  These
> are currently reset only incase of success.

...I think we should do this too, because it's intended that calling
shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should
again return SHM_MQ_DETACHED, not fail an assertion.  Can you see
whether the attached patch fixes this for you?

(Status update for Noah: I will provide another update regarding this
issue no later than Monday COB, US time.  I assume that Amit will have
responded by then, and it should hopefully be clear what the next step
is at that point.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

dont-fail-mq-assert-v1.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertions on parallel worker shutdown

Reply via email to