On 03/04/2018 03:40 AM, Andres Freund wrote: > > > On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >> On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >> -------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=98000 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9800 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >> loops=50) >>> --- 485,495 ---- >>> QUERY PLAN >>> >> -------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=97984 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9798 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >> loops=50) >>> >>> >>> Two tuples apparently went missing. >>> >>> Similar failures on the build farm: >>> >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 >>> >>> Could this be related to commit >>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit >>> 497171d3e2aaeea3b30d710b4e368645ad07ae43? >>> >> >> I think the same failure (or at least very similar plan diff) was >> already mentioned here: >> >> https://www.postgresql.org/message-id/17385.1520018...@sss.pgh.pa.us >> >> So I guess someone else already noticed, but I don't see the cause >> identified in that thread. > > Robert and I started discussing it a bit over IM. No conclusion. Robert tried > to reproduce locally, including disabling atomics, without luck. > > Can anybody reproduce locally? >
I've started "make check" with parallel_schedule tweaked to contain many select_parallel runs, and so far I've seen a couple of failures like this (about 10 failures out of 1500 runs): select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; ! ERROR: lost connection to parallel worker I have no idea why the worker fails (no segfaults in dmesg, nothing in posgres log), or if it's related to the issue discussed here at all. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services