On Mon, 29 Sept 2025 at 17:55, Lauro Ojeda <[email protected]> wrote:
> Hi David, > Thank you for your nice reply. > > I have the impression there is something heavily penalizing the usage of > partition pruning. While trying to go a bit further, I realized that > partition pruning is not occurring because the planner gives the cost of > seq-scanning all partitions to just over 2M units, but the query takes over > 3 minutes. When forcing the usage of PK index via hinting (which down the > line forces the partition pruning to happen), the cost sky-rocket to 53M > units (however scanning 4 partitions out of 12). Even though with much > higher cost, the query launches less workers and completes in a third (or > less) of the time of all partition scans. > > My observation is that the cost of seq-scanning each partition is > estimated in ~124k units, but index-scanning each partition is 440k (say, > 3x higher). Therefore, scanning 4 partitions, the total cost should be > ~1.7M units, which is below the 2M units from the cost of all partition > scanning, thus the planner should prefer the indexed path instead. However > when analyzing the hinted query (doing index scan and thus partition > pruning) we can note that the cost estimated to each partition scan was > added to the final cost, even though 8 out of 12 partitions were not > scanned (never executed). > > postgres=> explain analyze > /*+ IndexScan(accounts accounts_pkey) */ > select aid, abalance > from accounts > where transaction_date in (select dt_col from t2); > > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Nested Loop (cost=1.56..53359847.63 rows=2767123 width=8) (actual > time=2.299..111346.649 rows=1664742 loops=1) > -> HashAggregate (cost=1.12..1.23 rows=10 width=4) (actual > time=1.060..40.562 rows=6 loops=1) > Group Key: t2.dt_col > Batches: 1 Memory Usage: 24kB > -> Seq Scan on t2 (cost=0.00..1.10 rows=10 width=4) (actual > time=1.048..1.050 rows=6 loops=1) > -> Append (cost=0.43..5302777.61 rows=3320703 width=12) (actual > time=1.554..18514.181 rows=277457 loops=6) > -> Index Scan using accounts_p1_pkey on accounts_p1 accounts_1 > (cost=0.43..434635.82 rows=273011 width=12) (actual > time=2.214..17627.113 rows=276971 loops=1) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p2_pkey on accounts_p2 accounts_2 > (cost=0.43..441097.44 rows=277283 width=12) (actual > time=2.185..16280.020 rows=276634 loops=1) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p3_pkey on accounts_p3 accounts_3 > (cost=0.43..441666.94 rows=277392 width=12) (actual > time=0.751..25992.220 rows=278390 loops=1) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p4_pkey on accounts_p4 accounts_4 > (cost=0.43..441689.70 rows=277483 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p5_pkey on accounts_p5 accounts_5 > (cost=0.43..441883.26 rows=277546 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p6_pkey on accounts_p6 accounts_6 > (cost=0.43..*441857.26 *rows=277591 width=12) (actual > time=1.376..17001.777 rows=277582 loops=3) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p7_pkey on accounts_p7 accounts_7 > (cost=0.43..441837.57 rows=277503 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p8_pkey on accounts_p8 accounts_8 > (cost=0.43..441843.95 rows=277511 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p9_pkey on accounts_p9 accounts_9 > (cost=0.43..441711.03 rows=277506 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p10_pkey on accounts_p10 > accounts_10 (cost=0.43..441918.96 rows=277554 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p11_pkey on accounts_p11 > accounts_11 (cost=0.43..441501.86 rows=277377 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > -> Index Scan using accounts_p12_pkey on accounts_p12 > accounts_12 (cost=0.43..434530.31 rows=272946 width=12) (never executed) > Index Cond: (transaction_date = t2.dt_col) > Planning Time: 0.879 ms > Execution Time: 111432.755 ms > (32 rows) > > By looking into it, I have the impression there is a bug in the costing > sum in that situation, where the cost of the "never executed" partitions > should be deducted from the final cost estimation, which would make pruning > to be the preferred option in this case. > > The plan generation stage is where the costing sum is figured out whereas the "never executed" marking/detection is in the execution stage. So this doesnt look to be a costing bug for the above scenario where the planner was unable to prune partitions early. > Are my assumptions correct? > Is there anything I could do to influence the planner to dismiss the cost > of "never executed" scans? > > Kind regards, > Lauro Ojeda > > > On Thu, 25 Sept 2025 at 22:18, David Rowley <[email protected]> wrote: > >> On Fri, 26 Sept 2025 at 07:49, Lauro Ojeda <[email protected]> wrote: >> > The only way I found to make pruning work is to force index_scan using >> pg_hint_plan, but I wanted to influence the planner to decide it by itself >> rather than relying on hints. What's the reason for this misbehaving and >> what could I do to overcome it? >> >> > Partition key: RANGE (transaction_date) >> >> > postgres=> explain analyze >> > select aid, abalance >> > from accounts >> > where transaction_date in (select dt_col from t2); >> >> The only partition pruning that exists in PostgreSQL that can prune >> for that query is for parameterised Nested Loop joins. For Hash Join, >> it's been talked about, but this requires running the partition >> pruning code for every values that goes into the Hash Table and only >> scanning the unioned set of those partitions during the hash probe >> phase. The trouble with that is that it's very hard to know in advance >> if it'll be worth the extra effort. Putting a tuple into a hash table >> is quite cheap. Running the pruning code for a range partitioned table >> is likely to be a few times more costly than the hash insert >> (depending on how many partitions there are), so if the end result is >> that nothing was pruned, then that's quite a bit of extra effort for >> no gain. >> >> What we maybe could do better is reduce the cost of the Append scan >> when there's a run-time pruning object attached. This is a little >> tricky as we currently only build that object when creating the final >> plan. To include that in the costs we'd need to move that to the Path >> generation phase so that we didn't accidentally reject Paths which >> could be cheaper than we first think. >> >> > Also, how could I contribute to get this partition pruning to work? >> >> The pgsql-hackers mailing list is where all the discussions about that >> happen. There is plenty of past discussions on these topics. One such >> (fairly) recent discussion is in [1]. There are plenty more, including >> some ideas from Robert Haas about how we might cost run-time partition >> pruning. That was likely around 2017-2018 range, so you might need to >> dig deep to find that. >> >> David >> >> [1] >> https://www.postgresql.org/message-id/flat/CAApHDvoC7n_oceb%3D8z%2BMY8sTgH4xa%2ByAwBxZ4Dxv8pwkT9bOcA%40mail.gmail.com#45314d3d01ef8ad1eebe72111989062c >> > > > -- > *Lauro Ojeda* > -- -- Regards, Chetan
