Re: apply_scanjoin_target_to_paths and partitionwise join

Arne Roland Thu, 30 Oct 2025 15:08:13 -0700


On 2025-10-30 16:52, Robert Haas wrote:

Have you localized the problem to cpu_tuple_cost specifically, vs.
cpu_index_tuple_cost or cpu_operator_cost? I've generally found that I
need to reduce random_page_cost and seq_page_cost significantly to
avoid getting sequential scans when index scans would be more
reasonable, but that goes in the opposite direction as what you
suggest here, in that it brings the I/O and CPU costs closer together,
whereas your suggestion would push them further apart. I remember that
Kevin Grittner used to say that the default value of this parameter
was bad, too, but he recommended*raising* it:

In these particular cases I looked at the cpu_tuple cost is the mainissue, yes. I am not saying that the cpu_tuple_cost is the real culpritin general. I tune random_page_cost and seq_page cost frequently, but asyou rarely need to worry about the different cpu parts. I remember onedatabase where I lowered it to the default again, but that was a fewyears ago. I am no expert on cpu_tuple_cost generally.


On 2025-10-30 20:57, Robert Haas wrote:

On Thu, Oct 30, 2025 at 11:52 AM Robert Haas <[email protected]> wrote:

Right. Although that's the main thing here, I am inclined to suspect
there are other ways to hit this problem, maybe ways that are more
likely to happen in the real world, because...

And just like that, I found another way that this can happen. Consider
this query from the regression tests:

SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a
AND t1.a = t2.b ORDER BY t1.a, t2.b;

Each of prt1 and prt2 have three partitions. Since t1.a = t2.a and
t1.a = t2.b, the planner deduces that t2.a = t2.b. The only rows from
t2 where a = b are in the first partition. The planner therefore
estimates that if it just does a Merge Join between the two
partitioned tables, the Merge Join will stop early. There are actually
nine rows in t2 where a = b, but the planner's estimate is just three
rows, so it's understandable that it will very quickly run out of rows
on the t2 side of the join. Thus, in the non-partitionwise plan, while
its estimated cost to scan t1 is 44.83 and its estimated cost to scan
t2 is 5.55, the estimated cost of the join is only 7.99, reflecting
the fact that it doesn't anticipate having to actually finish the scan
of t1.

Now, if it does a partitionwise join, it still picks a Merge Join for
the prt1_p1/prt2_p1 sub-join, and that can still stop early. But for
the prt1_p2/prt2_p2 and prt1_p3/prt2_p3 joins, it picks hash joins,
which as far as the planner knows can't stop early, so there's no
opportunity to get a "discount" on the cost of scanning any of those
tables. As a result, the estimated cost of this plan ends up being
11.53, clearly more than the non-partitionwise estimated cost.

In this case, the planner's methodology doesn't really make a lot of
sense when you stop to think about it. If the planner is correct that
the non-partitionwise join will stop early, then the hash joins that
are chosen in the partitionwise scan will stop early, because the
non-parallel hashjoin code notices when the hash table built for the
inner side is empty and bails out without finishing the outer scan.
But the planner code is understandably reluctant to bet on a table
being completely empty for costing purposes. Performing a
non-partitionwise join allows the planner to make an end run around
this restriction: it can bet on the combined inner table ending up
with no rows contributed by the second or third child tables without
actually betting on any relation being completely empty.

Consequently, it doesn't seem like this type of case can account for
the original report of a massive real-world run time regression. The
planner's mental gymnastics here cause it to think that a
non-partitionwise plan will be faster, but as far as I can tell,
there's no real reason to expect that it actually will be.

100%. I think such cases are common. I had once a dirty hack, thatcreated several equivalent cases (most of them not even partitioningrelated) of sql queries planed them and tried to execute the ones withthe lowest cost. I ran it on my OLAP benchmark back then and noticed,that the average overall execution time (not the gazillion times ofadditional planning) suffered from this. If we give the planner a lot ofoptions, it has more chances to do bad estimates.Quite often the estimations for provable cardinality equivalent subpaths are vastly different. It would be so immensely helpful if we couldreason about that a bit better, but I have no idea how to do that.


Regards
Arne

Re: apply_scanjoin_target_to_paths and partitionwise join

Reply via email to