On Wed, Nov 2, 2016 at 10:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>>> I got confused by that a minute ago, so no you're not alone. The problem >>>> is even worse in join cases. For example: >>>> Gather (cost=34332.00..53265.35 rows=100 width=8) >>>> Workers Planned: 2 >>>> -> Hash Join (cost=33332.00..52255.35 rows=100 width=8) >>>> Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2)) >>>> -> Append (cost=0.00..8614.96 rows=417996 width=8) >>>> -> Parallel Seq Scan on pp (cost=0.00..8591.67 >>>> rows=416667 width=8) >>>> -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 >>>> width=8) >>>> -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) >>>> -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 >>>> width=8) >> Although - it is estimating 1M rows, but only "per worker" estimates are >> shown, and because there are 2 workers planned it says 1M/2.4 which is >> the 416k. I agree it's a bit unclear, but at least it's consistent with >> how we treat loops (i.e. that the numbers are per loop). > > Well, it's not *that* consistent. If we were estimating all the numbers > underneath the Gather as being per-worker numbers, that would make some > amount of sense. But neither the other seqscan, nor the hash on it, nor > the hashjoin's output count are scaled that way. It's very hard to call > the above display anything but flat-out broken.
While investigating why Rushabh Lathia's Gather Merge patch sometimes fails to pick a Gather Merge plan even when it really ought to do so, I ran smack into this problem. I discovered that this is more than a cosmetic issue. The costing itself is actually badly broken. In the single-table case, when you have just ... Gather -> Parallel Seq Scan ...the Parallel Seq Scan node reflects a per-worker row estimate, and the Gather node reflects a total row estimate. But in the join case, as shown above, the Gather thinks that the total number of rows which it will produce is equal to the number that will be produced by one single worker, which is crap, and the cost of doing the join in parallel is based on the per-worker rather than the total number, which is crappier. The difference in cost between the Gather and the underlying join in the above example is exactly 1010, namely 1000 for parallel_setup_cost and 100 tuples at 0.1 per tuple, even though 100 is the number of tuples per-worker, not the total number. That's really not good. I probably should have realized this when I looked at this thread the first time, but I somehow got it into my head that this was just a complaint about the imperfections of the display (which is indeed imperfect) and failed to realize that the same report was also pointing to an actual costing bug. I apologize for that. The reason why this is happening is that final_cost_nestloop(), final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit about whether the path they are generating is partial. They apply the row estimate for the joinrel itself to every such path generated for the join, except for parameterized paths which are a special case. I think this generally has the effect of discouraging parallel joins, because the inflated row count also inflates the join cost. I think the right thing to do is probably to scale the row count estimate for the joinrel's partial paths by the leader_contribution value computed in cost_seqscan. Despite my general hatred of back-patching things that cause plan changes, I'm inclined to think the fix for this should be back-patched to 9.6, because this is really a brown-paper-bag bug. If the consensus is otherwise I will of course defer to that consensus. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers