Re: [HACKERS] Parallel Append implementation

Amit Khandekar Sun, 15 Jan 2017 20:20:28 -0800

Thanks Ashutosh for the feedback.

On 6 January 2017 at 17:04, Ashutosh Bapat
<[email protected]> wrote:
> On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar <[email protected]> 
> wrote:
>> Currently an Append plan node does not execute its subplans in
>> parallel. There is no distribution of workers across its subplans. The
>> second subplan starts running only after the first subplan finishes,
>> although the individual subplans may be running parallel scans.
>>
>> Secondly, we create a partial Append path for an appendrel, but we do
>> that only if all of its member subpaths are partial paths. If one or
>> more of the subplans is a non-parallel path, there will be only a
>> non-parallel Append. So whatever node is sitting on top of Append is
>> not going to do a parallel plan; for example, a select count(*) won't
>> divide it into partial aggregates if the underlying Append is not
>> partial.
>>
>> The attached patch removes both of the above restrictions.  There has
>> already been a mail thread [1] that discusses an approach suggested by
>> Robert Haas for implementing this feature. This patch uses this same
>> approach.
>
> The first goal requires some kind of synchronization which will allow workers
> to be distributed across the subplans. The second goal requires some kind of
> synchronization to prevent multiple workers from executing non-parallel
> subplans. The patch uses different mechanisms to achieve the goals. If
> we create two different patches addressing each goal, those may be
> easier to handle.


Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

>
> We may want to think about a third goal: preventing too many workers
> from executing the same plan. As per comment in get_parallel_divisor()
> we do not see any benefit if more than 4 workers execute the same
> node. So, an append node can distribute more than 4 worker nodes
> equally among the available subplans. It might be better to do that as
> a separate patch.

I think that comment is for calculating leader contribution. It does
not say that 4 workers is too many workers in general.

But yes, I agree, and I have it in mind as the next improvement.
Basically, it does not make sense to give more than 3 workers to a
subplan when parallel_workers for that subplan are 3. For e.g., if
gather max workers is 10, and we have 2 Append subplans s1 and s2 with
parallel workers 3 and 5 respectively. Then, with the current patch,
it will distribute 4 workers to each of these workers. What we should
do is : once both of the subplans get 3 workers each, we should give
the 7th and 8th worker to s2.

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.


> Here are some review comments
I will handle the other comments, but first, just a quick response to
some important ones :

> 6. By looking at parallel_worker field of a path, we can say whether it's
> partial or not. We probably do not require to maintain a bitmap for that at in
> the Append path. The bitmap can be constructed, if required, at the time of
> creating the partial append plan. The reason to take this small step is 1. we
> want to minimize our work at the time of creating paths, 2. while freeing a
> path in add_path, we don't free the internal structures, in this case the
> Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

>
> 7. If we consider 6, we don't need concat_append_subpaths(), but still here 
> are
> some comments about that function. Instead of accepting two separate arguments
> childpaths and child_partial_subpaths_set, which need to be in sync, we can
> just pass the path which contains both of those. In the same following code 
> may
> be optimized by adding a utility function to Bitmapset, which advances
> all members
> by given offset and using that function with bms_union() to merge the
> bitmapset e.g.
> bms_union(*partial_subpaths_set,
> bms_advance_members(bms_copy(child_partial_subpaths_set), 
> append_subpath_len));
>     if (partial_subpaths_set)
>     {
>         for (i = 0; i < list_length(childpaths); i++)
>         {
>             /*
>              * The child paths themselves may or may not be partial paths. The
>              * bitmapset numbers of these paths will need to be set 
> considering
>              * that these are to be appended onto the partial_subpaths_set.
>              */
>             if (!child_partial_subpaths_set ||
>                 bms_is_member(i, child_partial_subpaths_set))
>             {
>                 *partial_subpaths_set = bms_add_member(*partial_subpaths_set,
>                                                        append_subpath_len + 
> i);
>             }
>         }
>     }

Again, for the reason mentioned above, we will defer this point for now.

>
> 8.
> -            parallel_workers = Max(parallel_workers, path->parallel_workers);
> +            /*
> +             * partial_subpaths can have non-partial subpaths so
> +             * path->parallel_workers can be 0. For such paths, allocate one
> +             * worker.
> +             */
> +            parallel_workers +=
> +                (path->parallel_workers > 0 ? path->parallel_workers : 1);
>
> This looks odd. Earlier code was choosing maximum of all parallel workers,
> whereas new code adds them all. E.g. if parallel_workers for subpaths is 3, 4,
> 3, without your change, it will pick up 4. But with your change it will pick
> 10. I think, you intend to write this as
> parallel_workers = Max(parallel_workers, path->parallel_workers ?
> path->parallel_workers : 1);
The intention is to add all workers, because a parallel-aware Append
is going to need them in order to make the subplans run with their
full capacity in parallel. So with subpaths with 3, 4, and 3 workers,
the Append path will need 10 workers. If it allocates 4 workers, its
not sufficient : Each of them would get only 1 worker, or max 2. In
the existing code, 4 is correct, because all the workers are going to
execute the same subplan node at a time.


>
> 9. Shouldn't this funciton return double?
> int
> get_parallel_divisor(int parallel_workers)
Yes, right, I will do that.

>
> 9. In get_parallel_divisor(), if parallel_worker is 0 i.e. it's a partial path
> the return value will be 2, which isn't true. This function is being called 
> for
> all the subpaths to get the original number of rows and costs of partial 
> paths.
> I think we don't need to call this function on subpaths which are not partial
> paths or make it work parallel_workers = 0.
I didn't understand this. I checked again get_parallel_divisor()
function code. I think it will return 1 if parallel_workers is 0. But
I may be missing something.


> 12. cost_append() essentially adds costs of all the subpaths and then divides
> by parallel_divisor. This might work if all the subpaths are partial paths. 
> But
> for the subpaths which are not partial, a single worker will incur the whole
> cost of that subpath. Hence just dividing all the total cost doesn't seem the
> right thing to do. We should apply different logic for costing non-partial
> subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

>
> 13. No braces required for single line block
> +    /* Increment worker count for the chosen node, if at all we found one. */
> +    if (min_whichplan != PA_INVALID_PLAN)
> +    {
> +        padesc->pa_info[min_whichplan].pa_num_workers++;
> +    }
>
> 14. exec_append_scan_first() is a one-liner function, should we just inline 
> it?
>
> 15. This patch replaces exec_append_initialize_next() with
> exec_append_scan_first(). The earlier function was handling backward and
> forward scans separately, but the later function doesn't do that. Why?

I will come to these and some other ones later.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

Reply via email to