On Fri, Aug 18, 2017 at 3:50 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> Ah-hah, I see my dromedary box is one of the ones failing, so I'll
>> have a look there. I can't reproduce it on my other machines.
> OK, so this is a whole lot more broken than I thought :-(.
> Bear in mind that the plan for this (omitting uninteresting detail) is
> Nested Loop Left Join
> -> Values Scan on "*VALUES*"
> -> Finalize GroupAggregate
> -> Gather Merge
> -> Partial GroupAggregate
> -> Sort
> -> Parallel Seq Scan on tenk1
> What seems to be happening is that:
> 1. On the first pass, the parallel seqscan work gets doled out to several
> workers, plus the leader, as expected.
> 2. When the nestloop rescans its right input, ExecReScanGatherMerge
> supposes that this is good enough to handle rescanning its subnodes:
> Leaving aside the question of why that doesn't look like nearly every
> other child rescan call, what happens is that that invokes ExecReScanAgg,
> which does the more usual thing:
> if (outerPlan->chgParam == NULL)
> and so that invokes ExecReScanSort, and then behold what ExecReScanSort
> thinks it can optimize away:
> * If subnode is to be rescanned then we forget previous sort results; we
> * have to re-read the subplan and re-sort. Also must re-sort if the
> * bounded-sort parameters changed or we didn't select randomAccess.
> * Otherwise we can just rewind and rescan the sorted output.
> So we never get to ExecReScanSeqScan, and thus not to heap_rescan,
> with the effect that parallel_scan->phs_nallocated never gets reset.
> 3. On the next pass, we fire up all the workers as expected, but they all
> perceive phs_nallocated >= rs_nblocks and conclude they have nothing to
> do. Meanwhile, in the leader, nodeSort just re-emits the sorted data it
> had last time. Net effect is that the GatherMerge node returns only the
> fraction of the data that was scanned by the leader in the first pass.
> 4. The fact that the test succeeds on many machines implies that the
> leader process is usually doing *all* of the work. This is in itself not
> very good. Even on the machines where it fails, the fact that the tuple
> counts are usually a pretty large fraction of the expected values
> indicates that the leader usually did most of the work. We need to take
> a closer look at why that is.
> However, the bottom line here is that parallel scan is completely broken
> for rescans, and it's not (solely) the fault of nodeGatherMerge; rather,
> the issue is that nobody bothered to wire up parallelism to the rescan
> parameterization mechanism.
I think we don't generate parallel plans for parameterized paths, so I
am not sure whether any work is required in that area.
> I imagine that related bugs can be
> demonstrated in 9.6 with little effort.
> I think that the correct fix probably involves marking each parallel scan
> plan node as dependent on a pseudo executor parameter, which the parent
> Gather or GatherMerge node would flag as being changed on each rescan.
> This would cue the plan layers in between that they cannot optimize on the
> assumption that the leader's instance of the parallel scan will produce
> exactly the same rows as it did last time, even when "nothing else
> changed". The "wtParam" pseudo parameter that's used for communication
> between RecursiveUnion and its descendant WorkTableScan node is a good
> model for what needs to happen.
Yeah, that seems like a good idea. I think another way could be to
*not* optimize rescanning when we are in parallel mode
(IsInParallelMode()), that might be restrictive as compared to what
you are suggesting, but will be somewhat simpler.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: