On Wed, Oct 29, 2025 at 9:23 PM Arne Roland <[email protected]> wrote: > The main factor of your example is, that the amount of rows handled by the > (Merge) Append is different.
Right. Although that's the main thing here, I am inclined to suspect there are other ways to hit this problem, maybe ways that are more likely to happen in the real world, because... > My second sentence just captured the mundane observation, if the join has > significantly more tuples, than any base relation, the place of the (Merge) > Append might be more relevant. If I join everything with a generate_series(1, > 30000) I get more tuples to process. ...as you imply, joins that inflate the row count are somewhat uncommon. They definitely do happen, but they're not the most typical pattern, and there might well be other reasons why a partitionwise join fails to win that we haven't figured out yet. These could even be cases where, for example, a certain optimization that works in the non-partitionwise case is not preserved in the partitionwise case. I feel like I now understand *one* case where Ashutosh's patch can make a demonstrable positive difference, but whether that's the only case that exists seems quite uncertain. > I'd like to make one more side note about this example: The planner punishes > the partitionwise join for having an extra node, that emits N rows (three > Hash joins + Append vs two Appends + Hash Join). This plan is chosen because > of the cpu_tuple_cost. I'm happy it picks the plan with the smaller memory > footprint, but in my real world experience for a timing based approach the > default cpu_tuple_cost tends to be too high to get a fair comparison between > partitionwise and non partitionwise joins. Have you localized the problem to cpu_tuple_cost specifically, vs. cpu_index_tuple_cost or cpu_operator_cost? I've generally found that I need to reduce random_page_cost and seq_page_cost significantly to avoid getting sequential scans when index scans would be more reasonable, but that goes in the opposite direction as what you suggest here, in that it brings the I/O and CPU costs closer together, whereas your suggestion would push them further apart. I remember that Kevin Grittner used to say that the default value of this parameter was bad, too, but he recommended *raising* it: https://www.postgresql.org/message-id/1385148245.49487.YahooMailNeo%40web162904.mail.bf1.yahoo.com https://www.postgresql.org/message-id/[email protected] https://www.postgresql.org/message-id/CACjxUsNp4uEx3xsunw4wVpBDVomas7o6hnv_49bSbaz-HAVdyA%40mail.gmail.com I don't actually know what's best in terms of settings in this area. I don't have experience tuning for partitionwise join specifically. -- Robert Haas EDB: http://www.enterprisedb.com
