Re: [HACKERS] Parallel Append implementation

```On 19 February 2017 at 14:59, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Feb 17, 2017 at 2:56 PM, Amit Khandekar <amitdkhan...@gmail.com>
> wrote:
>> The log2(num_children)+1 formula which you proposed does not take into
>> account the number of workers for each of the subplans, that's why I
>> am a bit more inclined to look for some other logic. May be, treat the
>> children as if they belong to partitions, and accordingly calculate
>> the final number of workers. So for 2 children with 4 and 5 workers
>> respectively, Append parallel_workers would be : log3(3^4 + 3^5) .
>
> In general this will give an answer not different by more than 1 or 2
> from my answer, and often exactly the same.  In the case you mention,
> whether we get the same answer depends on which way you round:
> log3(3^4+3^5) is 5 if you round down, 6 if you round up.
>
> My formula is more aggressive when there are many subplans that are
> not parallel or take only 1 worker, because I'll always use at least 5
> workers for an append that has 9-16 children, whereas you might use
> only 2 if you do log3(3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0).  In that
> case I like my formula better. With lots of separate children, the
> chances of being able to use as many as 5 workers seem good.  (Note
> that using 9 workers as Ashutosh seems to be proposing would be a
> waste if the different children have very unequal execution times,
> because the workers that run children with short execution times can
> be reused to run additional subplans while the long ones are still
> running.  Running a separate worker for each child only works out if
> the shortest runtime is more than 50% of the longest runtime, which
> may sometimes be true but doesn't seem like a good bet in general.)
>
> Your formula is more aggressive when you have 3 children that all use
> the same number of workers; it'll always decide on <number of workers
> per child>+1, whereas mine won't add the extra worker in that case.
> Possibly your formula is better than mine in that case, but I'm not
> sure.  If you have as many as 9 children that all want N workers, your
> formula will decide on N+2 workers, but since my formula guarantees a
> minimum of 5 workers in such cases, I'll probably be within 1 of
> whatever answer you were getting.
>```
```
Yeah, that seems to be right in most of the cases. The only cases
where your formula seems to give too few workers is for something like
: (2, 8, 8). For such subplans, we should at least allocate 8 workers.
It turns out that in most of the cases in my formula, the Append
workers allocated is just 1 worker more than the max per-subplan
worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
So in the patch, in addition to the log2() formula you proposed, I
have made sure that it allocates at least equal to max(per-subplan
parallel_workers values).

>
>> BTW, there is going to be some logic change in the choose-next-subplan
>> algorithm if we consider giving extra workers to subplans.
>
> I'm not sure that it's going to be useful to make this logic very
> complicated.  I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan.  In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie) without worrying too much about the target number of workers
> for that subplan.

In the attached v5 patch, the logic of distributing the workers is now
kept simple : it just distributes the workers equally without
considering the per-sublan parallel_workers value. I have retained the
earlier logic of choosing the plan with minimum current workers. But
now that the pa_max_workers is not needed, I removed it, and instead a
partial_plans bitmapset is added in the Append node. Once a worker
picks up a non-partial subplan, it immediately changes its
pa_num_workers to -1. Whereas for partial subplans, the worker sets it
to -1 only after it finishes executing it.

Effectively, in parallel_append_next(), the check for whether subplan
is executing with max parallel_workers is now removed, and all code
that was using pa_max_workers is now removed.

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:
> 10. We should probably move the parallel_safe calculation out of
> cost_append().
> +            path->parallel_safe = path->parallel_safe &&
> +                                  subpath->parallel_safe;
>
> 11. This check shouldn't be part of cost_append().
> +            /* All child paths must have same parameterization */
> +            Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>

Moved out these two statements from cost_append(). Did it separately
in create_append_path().

Also, I have removed some elog() statements which were there while
inside Spinlock in parallel_append_next().

On 17 January 2017 at 11:10, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> I was looking at the executor portion of this patch and I noticed that in
> exec_append_initialize_next():
>
>         return parallel_append_next(appendstate);
>
>     /*
>      * Not parallel-aware. Fine, just go on to the next subplan in the
>      * appropriate direction.
>      */
>     if (ScanDirectionIsForward(appendstate->ps.state->es_direction))
>         appendstate->as_whichplan++;
>     else
>         appendstate->as_whichplan--;
>
> which seems to mean that executing Append in parallel mode disregards the
> scan direction.  I am not immediately sure what implications that has, so
> I checked what heap scan does when executing in parallel mode, and found
> this in heapgettup():
>
>     else if (backward)
>     {
>         /* backward parallel scan not supported */
>         Assert(scan->rs_parallel == NULL);
>
> Perhaps, AppendState.as_padesc would not have been set if scan direction
> is backward, because parallel mode would be disabled for the whole query
> in that case (PlannerGlobal.parallelModeOK = false).  Maybe add an
> Assert() similar to one in heapgettup().
>

Right. Thanks for noticing this. I have added a similar Assert in
exec_append_initialize_next().
```

ParallelAppend_v5.patch
Description: Binary data

```--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```