Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

Tomas Vondra Sat, 03 Mar 2018 19:08:42 -0800


On 03/04/2018 03:40 AM, Andres Freund wrote:
> 
> 
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.von...@2ndquadrant.com> 
> wrote:
>> On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>>                                   QUERY PLAN
>>>  
>> --------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9800 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>> --- 485,495 ----
>>>                                   QUERY PLAN
>>>  
>> --------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9798 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>> I think the same failure (or at least very similar plan diff) was
>> already mentioned here:
>>
>> https://www.postgresql.org/message-id/17385.1520018...@sss.pgh.pa.us
>>
>> So I guess someone else already noticed, but I don't see the cause
>> identified in that thread.
> 
> Robert and I started discussing it a bit over IM. No conclusion. Robert tried 
> to reproduce locally, including disabling atomics, without luck.
> 
> Can anybody reproduce locally?
>


I've started "make check" with parallel_schedule tweaked to contain many
select_parallel runs, and so far I've seen a couple of failures like
this (about 10 failures out of 1500 runs):

  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
! ERROR:  lost connection to parallel worker

I have no idea why the worker fails (no segfaults in dmesg, nothing in
posgres log), or if it's related to the issue discussed here at all.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

Reply via email to