Thanks Tom for the crash analysis, I think this is the same reason that
Rajkumar's reported case[1] was crashing only with partition-wise-join = on.
I tried to fix this crash[2] but missed the place where I have added assert
check in the assumption that we always coming from the previous check in the
while loop.

Instead, assert check we need a similar bailout logic[2] before looping back to
first partial plan. Attached patch does the same, I've tested with
parallel_leader_participation = off setting as suggested by Andres, make check
looks good except there is some obvious regression diff.

1] 
https://postgr.es/m/CAKcux6m+6nTO6C08kKaj-Waffvgvp-9SD3RCGStX=mzk0gq...@mail.gmail.com
2] 
https://postgr.es/m/CAAJ_b975k58H+Ed4=p0vbJunwO2reOMX5CVB8_R=JmXxY3uW=q...@mail.gmail.com

Regards,
Amul

On Wed, Dec 6, 2017 at 9:31 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> Robert Haas <rh...@postgresql.org> writes:
>>> Support Parallel Append plan nodes.
>
>> Buildfarm member sidewinder doesn't like this:
>
> It transpires that my machine prairiedog also shows the failure.
> I instrumented nodeAppend.c (as per attached patch) to see what
> was going on, and I get this trace on prairiedog:
>
> 2017-12-05 22:50:30.228 EST [28812] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:50:30.228 EST [28812] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a1;
> 2017-12-05 22:50:30.241 EST [28813] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:50:30.241 EST [28813] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a1;
> 2017-12-05 22:50:30.252 EST [28814] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:50:30.252 EST [28814] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a1;
> 2017-12-05 22:50:30.363 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan 0, fpp 2, nplans 6
> 2017-12-05 22:50:30.363 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.363 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 0, fpp 2, nplans 6
> 2017-12-05 22:50:30.363 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.363 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 0, nextplan 1, fpp 2, nplans 6
> 2017-12-05 22:50:30.363 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.364 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 1, fpp 2, nplans 6
> 2017-12-05 22:50:30.364 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.364 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 1, nextplan 2, fpp 2, nplans 6
> 2017-12-05 22:50:30.364 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.364 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 2, fpp 2, nplans 6
> 2017-12-05 22:50:30.364 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.364 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 2, nextplan 3, fpp 2, nplans 6
> 2017-12-05 22:50:30.364 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.364 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 3, fpp 2, nplans 6
> 2017-12-05 22:50:30.364 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.365 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 3, nextplan 4, fpp 2, nplans 6
> 2017-12-05 22:50:30.365 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.365 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 4, fpp 2, nplans 6
> 2017-12-05 22:50:30.365 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.365 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 4, nextplan 5, fpp 2, nplans 6
> 2017-12-05 22:50:30.365 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.365 EST [28816] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 5, fpp 2, nplans 6
> 2017-12-05 22:50:30.365 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.365 EST [28816] LOG:  choose_next_subplan_for_worker: 
> whichplan 5, nextplan 2, fpp 2, nplans 6
> 2017-12-05 22:50:30.365 EST [28816] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.407 EST [28815] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 2, nplans 6
> 2017-12-05 22:50:30.407 EST [28815] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.474 EST [28817] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 2, nplans 6
> 2017-12-05 22:50:30.474 EST [28817] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a2;
> 2017-12-05 22:50:30.568 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan 0, fpp 6, nplans 6
> 2017-12-05 22:50:30.568 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.568 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 0, fpp 6, nplans 6
> 2017-12-05 22:50:30.568 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.569 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan 0, nextplan 1, fpp 6, nplans 6
> 2017-12-05 22:50:30.569 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.569 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 1, fpp 6, nplans 6
> 2017-12-05 22:50:30.569 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.569 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan 1, nextplan 2, fpp 6, nplans 6
> 2017-12-05 22:50:30.569 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.569 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 2, fpp 6, nplans 6
> 2017-12-05 22:50:30.569 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan 2, nextplan 3, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 3, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan 3, nextplan 4, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 4, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> whichplan 4, nextplan 5, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> 2017-12-05 22:50:30.570 EST [28818] LOG:  choose_next_subplan_for_worker: 
> picking nextplan 5, fpp 6, nplans 6
> 2017-12-05 22:50:30.570 EST [28818] STATEMENT:  select round(avg(aa)), 
> sum(aa) from a_star a3;
> TRAP: FailedAssertion("!(append->first_partial_plan < node->as_nplans)", 
> File: "nodeAppend.c", Line: 520)
> 2017-12-05 22:50:38.211 EST [28158] LOG:  background worker "parallel worker" 
> (PID 28818) was terminated by signal 6: Abort trap
> 2017-12-05 22:50:38.211 EST [28158] DETAIL:  Failed process was running: 
> select round(avg(aa)), sum(aa) from a_star a3;
>
> That makes it pretty clear what's going wrong, but why don't we see it
> elsewhere?
>
> On my development machine, the same patch yields
>
> 2017-12-05 22:42:24.187 EST [843] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:42:24.187 EST [843] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a1;
> 2017-12-05 22:42:24.188 EST [844] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:42:24.188 EST [844] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a1;
> 2017-12-05 22:42:24.188 EST [845] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 0, nplans 6
> 2017-12-05 22:42:24.188 EST [845] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a1;
> 2017-12-05 22:42:24.197 EST [846] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 2, nplans 6
> 2017-12-05 22:42:24.197 EST [846] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a2;
> 2017-12-05 22:42:24.200 EST [847] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 2, nplans 6
> 2017-12-05 22:42:24.200 EST [847] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a2;
> 2017-12-05 22:42:24.200 EST [848] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 2, nplans 6
> 2017-12-05 22:42:24.200 EST [848] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a2;
> 2017-12-05 22:42:24.214 EST [849] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 6, nplans 6
> 2017-12-05 22:42:24.214 EST [849] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a3;
> 2017-12-05 22:42:24.214 EST [852] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 6, nplans 6
> 2017-12-05 22:42:24.214 EST [852] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a3;
> 2017-12-05 22:42:24.214 EST [853] LOG:  choose_next_subplan_for_worker: 
> whichplan -1, nextplan -1, fpp 6, nplans 6
> 2017-12-05 22:42:24.214 EST [853] STATEMENT:  select round(avg(aa)), sum(aa) 
> from a_star a3;
>
> I conclude that the reason we don't see the crash except on the slowest
> buildfarm critters is that on most machines the leader process manages
> to finish off all the subplans before any of the workers get as far as
> doing something.  This doesn't give me a warm fuzzy feeling about how
> much testing this patch has gotten.
>
>                         regards, tom lane
>

Attachment: Parallel-Append-crash-fix.patch
Description: Binary data

Reply via email to