On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas <robertmh...@gmail.com> wrote: >> >> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila <amit.kapil...@gmail.com> >> wrote: >> > Don't we need the startup cost incase we need to build partial paths for >> > joinpaths like mergepath? >> > Also, I think there are other cases for single relation scan where >> > startup >> > cost can matter like when there are psuedoconstants in qualification >> > (refer cost_qual_eval_walker()) or let us say if someone has disabled >> > seq scan (disable_cost is considered as startup cost.) >> >> I'm not saying that we don't need to compute it. I'm saying we don't >> need to take it into consideration when deciding which paths have >> merit. Note that consider_statup is set this way: >> >> rel->consider_startup = (root->tuple_fraction > 0); >> > > Even when consider_startup is false, still startup_cost is used for cost > calc, now may be ignoring that is okay for partial paths, but still it seems > worth thinking why leaving for partial paths it is okay even though it > is used in add_path().
That is explained in the comments, and I just explained it again in my previous email. I'm not sure how much clearer I can make it. For a regular path, it might sometimes be useful to pick a path with a higher total cost if it has a lower startup cost. The reason this could be useful is because we might not run the resulting plan to completion. However, parallel queries are always run to completion, so a lower startup cost isn't useful. We just want the path with the lowest total cost. I don't know what else to say here unless you can ask a more specific question. > Won't it be useful to consider parameterized paths for below kind of > plans where we can push the jointree to worker and each worker can > scan the complete outer relation A and then the rest work is divided > among workers (ofcourse there can be other ways to parallelize such joins, > but still the way described also seems to be possible)? > > NestLoop > -> Seq Scan on A > Hash Join > Join Condition: B.Y = C.W > -> Seq Scan on B > -> Index Scan using C_Z_IDX on C > Index Condition: C.Z = A.X I had thought that this sort of plan wouldn't actually occur in real life, but it seems that it does. What you've written here is a little muddled - the hash join has no hash underneath, for example, and there'd have to be some sort of join order restriction in order to consider a plan of this type. However, somewhat to my surprise, I was able to get a plan much like this by doing this: rhaas=# create table a (x int); CREATE TABLE rhaas=# insert into a values (1); INSERT 0 1 rhaas=# create table b (y int, filler text); CREATE TABLE rhaas=# insert into b select g, random()::text||random()::text||random()::text||random()::text from generate_series(1,1000000) g; INSERT 0 1000000 rhaas=# create table c (z int, w int, filler text); CREATE TABLE rhaas=# insert into c select g, g, random()::text||random()::text||random()::text||random()::text from generate_series(1,1000000) g; INSERT 0 1000000 rhaas=# create index c_z_idx on c (z); CREATE INDEX rhaas=# vacuum analyze; VACUUM rhaas=# explain analyze select * from A LEFT JOIN (B INNER JOIN C ON B.Y = C.W) ON C.Z = A.x; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Nested Loop Left Join (cost=8.46..26810.48 rows=1 width=152) (actual time=0.076..166.946 rows=1 loops=1) -> Seq Scan on a (cost=0.00..1.01 rows=1 width=4) (actual time=0.015..0.016 rows=1 loops=1) -> Hash Join (cost=8.46..26809.47 rows=1 width=148) (actual time=0.057..166.925 rows=1 loops=1) Hash Cond: (b.y = c.w) -> Seq Scan on b (cost=0.00..23051.00 rows=1000000 width=72) (actual time=0.012..89.013 rows=1000000 loops=1) -> Hash (cost=8.44..8.44 rows=1 width=76) (actual time=0.035..0.035 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Index Scan using c_z_idx on c (cost=0.42..8.44 rows=1 width=76) (actual time=0.031..0.032 rows=1 loops=1) Index Cond: (z = a.x) Planning time: 0.394 ms Execution time: 167.015 ms (11 rows) This is extremely narrow. If you have more than one row in A, the planner doesn't pick a nested loop. And if you're actually going to be running queries like this frequently, then you should create an index on B (Y), which causes you to get an execution time of ~ 0.2 ms instead of 167 ms, because we generate a parameterized nestloop over two index scans. But if you have no index on B (Y) and a does contain precisely one row, then you can get this sort of plan, and hey, the runtime is even long enough for parallelism to potentially be useful. But after thinking about it for a while, I realized that even if you think it's important to cater to that case, you still can't just switch the sequential scan on B out for a parallel sequential scan and stick a Gather node on top. It will not work. The problem is that, although we would probably only pick this plan if A contains <= 1 row, it might turn out at execution time that a second row has been inserted since statistics were analyzed. Then, we'd need to rescan B. But it isn't necessarily the case that every worker would finish the first scan of B at exactly the same time. The first worker to finish scanning B would punch the rescan button, and now chaos ensues, because all the other workers now start seeing - for the same row in A - rows from B that have already been processed. Oops. To make it safe, you'd need some kind of synchronization that would guarantee that nobody tries to rescan B until everybody has finished the initial scan. We do not have such a mechanism today, so this kind of plan is simply unsafe. There's another problem, too. Suppose there were also a volatile filter condition on A. It's possible that scans in two different workers could arrive at two different conclusions about which tuple from A to perform the B/C join for first. So now we've got each of them receiving part of the B/C rows and joining them against two different A tuples, which is nonsense. We could handle this problem by not generating this sort of plan when there are any volatile filter conditions on A, but that sounds pretty fragile. We could also handle it by sticking the Gather node on top of the Hash Join instead of at the top of the plan tree, which sounds a lot safer but would require executor infrastructure we don't have - specifically, the parameter-passing stuff. So, all in all, I think this isn't a very promising type of plan - both because we haven't got the infrastructure to make it safe to execute today, and because even if we did have that infrastructure it wouldn't be the right choice except in narrow circumstances. We can of course revise that decision in the future if things look different then. > Is the main reason to have add_partial_path() is that it has some > less checks or is it that current add_path will give wrong answers > in any case? The main reason is that it adds things to the partial_pathlist rather than the pathlist, but the fact that it has fewer checks is a very nice bonus. add_path() is performance-critical, and I'd rather not complicate it further with more if statements, especially when a much much simpler version will do for partial paths. > typo - 'incompable' OK, I can fix that. >> > A. >> > This means that for inheritance child relations for which rel pages are >> > less than parallel_threshold, it will always consider the cost shared >> > between 1 worker and leader as per below calc in cost_seqscan: >> > if (path->parallel_degree > 0) >> > run_cost = run_cost / (path->parallel_degree + 0.5); >> > >> > I think this might not be the appropriate cost model for even for >> > non-inheritence relations which has pages more than parallel_threshold, >> > but it seems to be even worst for inheritance children which have >> > pages less than parallel_threshold >> >> Why? > > Because I think the way code is written, it assumes that for each of the > inheritence-child relation which has pages lesser than threshold, half > the work will be done by master-backend which doesn't seem to be the > right distribution. Consider a case where there are three such children > each having cost 100 to scan, now it will cost them as > 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is > considering 0.5 of master backends work which seems to be wrong. > > I think for Append case, we should consider this cost during Append path > creation in create_append_path(). Basically we can make cost_seqscan > to ignore the cost reduction due to parallel_degree for inheritance > relations > and then during Append path creation we can consider it and also consider > work unit of master backend as 0.5 with respect to overall work. No, I don't think that's right. It's true that the way we're calculating parallel_degree for each relation is unprincipled right now, and we need to improve that. But if it were correct, then what we're doing here would also be correct. If the number of workers chosen for each child plan reflected the maximum number that could be used effectively by that child plan, then any extras wouldn't speed things up even if they were present, so the Append's cost calculation would be right. > - > --- a/src/backend/optimizer/README > +++ b/src/backend/optimizer/README > +plan as possible. Expanding the range of cases in which more work can be > +pushed below the Gather (and > costly them accurately) is likely to keep us > +busy for a long time to come. > > Seems there is a typo in above text. > /costly/cost OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers