Re: [HACKERS] [DESIGN] ParallelAppend

Robert Haas Thu, 19 Nov 2015 11:56:30 -0800

On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> > Don't we need the startup cost incase we need to build partial paths for
>> > joinpaths like mergepath?
>> > Also, I think there are other cases for single relation scan where
>> > startup
>> > cost can matter like when there are psuedoconstants in qualification
>> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
>> > seq scan (disable_cost is considered as startup cost.)
>>
>> I'm not saying that we don't need to compute it.  I'm saying we don't
>> need to take it into consideration when deciding which paths have
>> merit.   Note that consider_statup is set this way:
>>
>>     rel->consider_startup = (root->tuple_fraction > 0);
>>
>
> Even when consider_startup is false, still startup_cost is used for cost
> calc, now may be ignoring that is okay for partial paths, but still it seems
> worth thinking why leaving for partial paths it is okay even though it
> is used in add_path().


That is explained in the comments, and I just explained it again in my
previous email.  I'm not sure how much clearer I can make it.  For a
regular path, it might sometimes be useful to pick a path with a
higher total cost if it has a lower startup cost.  The reason this
could be useful is because we might not run the resulting plan to
completion.  However, parallel queries are always run to completion,
so a lower startup cost isn't useful.  We just want the path with the
lowest total cost.  I don't know what else to say here unless you can
ask a more specific question.

> Won't it be useful to consider parameterized paths for below kind of
> plans where we can push the jointree to worker and each worker can
> scan the complete outer relation A and then the rest work is divided
> among workers (ofcourse there can be other ways to parallelize such joins,
> but still the way described also seems to be possible)?
>
> NestLoop
> -> Seq Scan on A
> Hash Join
> Join Condition: B.Y = C.W
> -> Seq Scan on B
> -> Index Scan using C_Z_IDX on C
> Index Condition: C.Z = A.X

I had thought that this sort of plan wouldn't actually occur in real
life, but it seems that it does.  What you've written here is a little
muddled - the hash join has no hash underneath, for example, and
there'd have to be some sort of join order restriction in order to
consider a plan of this type.  However, somewhat to my surprise, I was
able to get a plan much like this by doing this:

rhaas=# create table a (x int);
CREATE TABLE
rhaas=# insert into a values (1);
INSERT 0 1
rhaas=# create table b (y int, filler text);
CREATE TABLE
rhaas=# insert into b select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,1000000) g;
INSERT 0 1000000
rhaas=# create table c (z int, w int, filler text);
CREATE TABLE
rhaas=# insert into c select g, g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,1000000) g;
INSERT 0 1000000
rhaas=# create index c_z_idx on c (z);
CREATE INDEX
rhaas=# vacuum analyze;
VACUUM
rhaas=# explain analyze select * from A LEFT JOIN (B INNER JOIN C ON
B.Y = C.W) ON C.Z = A.x;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=8.46..26810.48 rows=1 width=152) (actual
time=0.076..166.946 rows=1 loops=1)
   ->  Seq Scan on a  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.015..0.016 rows=1 loops=1)
   ->  Hash Join  (cost=8.46..26809.47 rows=1 width=148) (actual
time=0.057..166.925 rows=1 loops=1)
         Hash Cond: (b.y = c.w)
         ->  Seq Scan on b  (cost=0.00..23051.00 rows=1000000
width=72) (actual time=0.012..89.013 rows=1000000 loops=1)
         ->  Hash  (cost=8.44..8.44 rows=1 width=76) (actual
time=0.035..0.035 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Index Scan using c_z_idx on c  (cost=0.42..8.44
rows=1 width=76) (actual time=0.031..0.032 rows=1 loops=1)
                     Index Cond: (z = a.x)
 Planning time: 0.394 ms
 Execution time: 167.015 ms
(11 rows)

This is extremely narrow.  If you have more than one row in A, the
planner doesn't pick a nested loop.  And if you're actually going to
be running queries like this frequently, then you should create an
index on B (Y), which causes you to get an execution time of ~ 0.2 ms
instead of 167 ms, because we generate a parameterized nestloop over
two index scans.  But if you have no index on B (Y) and a does contain
precisely one row, then you can get this sort of plan, and hey, the
runtime is even long enough for parallelism to potentially be useful.

But after thinking about it for a while, I realized that even if you
think it's important to cater to that case, you still can't just
switch the sequential scan on B out for a parallel sequential scan and
stick a Gather node on top.  It will not work.  The problem is that,
although we would probably only pick this plan if A contains <= 1 row,
it might turn out at execution time that a second row has been
inserted since statistics were analyzed.  Then, we'd need to rescan B.
But it isn't necessarily the case that every worker would finish the
first scan of B at exactly the same time.  The first worker to finish
scanning B would punch the rescan button, and now chaos ensues,
because all the other workers now start seeing - for the same row in A
- rows from B that have already been processed.  Oops.  To make it
safe, you'd need some kind of synchronization that would guarantee
that nobody tries to rescan B until everybody has finished the initial
scan.  We do not have such a mechanism today, so this kind of plan is
simply unsafe.

There's another problem, too.  Suppose there were also a volatile
filter condition on A.  It's possible that scans in two different
workers could arrive at two different conclusions about which tuple
from A to perform the B/C join for first.  So now we've got each of
them receiving part of the B/C rows and joining them against two
different A tuples, which is nonsense.  We could handle this problem
by not generating this sort of plan when there are any volatile filter
conditions on A, but that sounds pretty fragile.  We could also handle
it by sticking the Gather node on top of the Hash Join instead of at
the top of the plan tree, which sounds a lot safer but would require
executor infrastructure we don't have - specifically, the
parameter-passing stuff.

So, all in all, I think this isn't a very promising type of plan -
both because we haven't got the infrastructure to make it safe to
execute today, and because even if we did have that infrastructure it
wouldn't be the right choice except in narrow circumstances.  We can
of course revise that decision in the future if things look different
then.

> Is the main reason to have add_partial_path() is that it has some
> less checks or is it that current add_path will give wrong answers
> in any case?

The main reason is that it adds things to the partial_pathlist rather
than the pathlist, but the fact that it has fewer checks is a very
nice bonus.  add_path() is performance-critical, and I'd rather not
complicate it further with more if statements, especially when a much
much simpler version will do for partial paths.

> typo - 'incompable'

OK, I can fix that.

>> > A.
>> > This means that for inheritance child relations for which rel pages are
>> > less than parallel_threshold, it will always consider the cost shared
>> > between 1 worker and leader as per below calc in cost_seqscan:
>> > if (path->parallel_degree > 0)
>> > run_cost = run_cost / (path->parallel_degree + 0.5);
>> >
>> > I think this might not be the appropriate cost model for even for
>> > non-inheritence relations which has pages more than parallel_threshold,
>> > but it seems to be even worst for inheritance children which have
>> > pages less than parallel_threshold
>>
>> Why?
>
> Because I think the way code is written, it assumes that for each of the
> inheritence-child relation which has pages lesser than threshold, half
> the work will be done by master-backend which doesn't seem to be the
> right distribution.  Consider a case where there are three such children
> each having cost 100 to scan, now it will cost them as
> 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
> considering 0.5 of master backends work which seems to be wrong.
>
> I think for Append case, we should consider this cost during Append path
> creation in create_append_path().  Basically we can make cost_seqscan
> to ignore the cost reduction due to parallel_degree for inheritance
> relations
> and then during Append path creation we can consider it and also consider
> work unit of master backend as 0.5 with respect to overall work.

No, I don't think that's right.  It's true that the way we're
calculating parallel_degree for each relation is unprincipled right
now, and we need to improve that.  But if it were correct, then what
we're doing here would also be correct.  If the number of workers
chosen for each child plan reflected the maximum number that could be
used effectively by that child plan, then any extras wouldn't speed
things up even if they were present, so the Append's cost  calculation
would be right.

> -
> --- a/src/backend/optimizer/README
> +++ b/src/backend/optimizer/README
> +plan as possible.  Expanding the range of cases in which more work can be
> +pushed below the Gather (and
> costly them accurately) is likely to keep us
> +busy for a long time to come.
>
> Seems there is a typo in above text.
> /costly/cost

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

Reply via email to