On Fri, Aug 18, 2017 at 3:50 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> Ah-hah, I see my dromedary box is one of the ones failing, so I'll
>> have a look there.  I can't reproduce it on my other machines.
> OK, so this is a whole lot more broken than I thought :-(.
> Bear in mind that the plan for this (omitting uninteresting detail) is
>  Nested Loop Left Join
>    ->  Values Scan on "*VALUES*"
>    ->  Finalize GroupAggregate
>          ->  Gather Merge
>                ->  Partial GroupAggregate
>                      ->  Sort
>                            ->  Parallel Seq Scan on tenk1
> What seems to be happening is that:
> 1. On the first pass, the parallel seqscan work gets doled out to several
> workers, plus the leader, as expected.
> 2. When the nestloop rescans its right input, ExecReScanGatherMerge
> supposes that this is good enough to handle rescanning its subnodes:
>         ExecReScan(node->ps.lefttree);
> Leaving aside the question of why that doesn't look like nearly every
> other child rescan call, what happens is that that invokes ExecReScanAgg,
> which does the more usual thing:
>         if (outerPlan->chgParam == NULL)
>                 ExecReScan(outerPlan);
> and so that invokes ExecReScanSort, and then behold what ExecReScanSort
> thinks it can optimize away:
>      * If subnode is to be rescanned then we forget previous sort results; we
>      * have to re-read the subplan and re-sort.  Also must re-sort if the
>      * bounded-sort parameters changed or we didn't select randomAccess.
>      *
>      * Otherwise we can just rewind and rescan the sorted output.
> So we never get to ExecReScanSeqScan, and thus not to heap_rescan,
> with the effect that parallel_scan->phs_nallocated never gets reset.
> 3. On the next pass, we fire up all the workers as expected, but they all
> perceive phs_nallocated >= rs_nblocks and conclude they have nothing to
> do.  Meanwhile, in the leader, nodeSort just re-emits the sorted data it
> had last time.  Net effect is that the GatherMerge node returns only the
> fraction of the data that was scanned by the leader in the first pass.
> 4. The fact that the test succeeds on many machines implies that the
> leader process is usually doing *all* of the work.  This is in itself not
> very good.  Even on the machines where it fails, the fact that the tuple
> counts are usually a pretty large fraction of the expected values
> indicates that the leader usually did most of the work.  We need to take
> a closer look at why that is.
> However, the bottom line here is that parallel scan is completely broken
> for rescans, and it's not (solely) the fault of nodeGatherMerge; rather,
> the issue is that nobody bothered to wire up parallelism to the rescan
> parameterization mechanism.

I think we don't generate parallel plans for parameterized paths, so I
am not sure whether any work is required in that area.

>  I imagine that related bugs can be
> demonstrated in 9.6 with little effort.
> I think that the correct fix probably involves marking each parallel scan
> plan node as dependent on a pseudo executor parameter, which the parent
> Gather or GatherMerge node would flag as being changed on each rescan.
> This would cue the plan layers in between that they cannot optimize on the
> assumption that the leader's instance of the parallel scan will produce
> exactly the same rows as it did last time, even when "nothing else
> changed".  The "wtParam" pseudo parameter that's used for communication
> between RecursiveUnion and its descendant WorkTableScan node is a good
> model for what needs to happen.

Yeah, that seems like a good idea.  I think another way could be to
*not* optimize rescanning when we are in parallel mode
(IsInParallelMode()), that might be restrictive as compared to what
you are suggesting, but will be somewhat simpler.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to