Over on [1], Tim reported that the planner is making some bad choices when the plan contains a WindowFunc which requires reading all of, or a large portion of the WindowAgg subnode in order to produce the first WindowAgg row.
For example: EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) OVER () FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.tenthous LIMIT 1; With master, we get the following plan: QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.29..0.67 rows=1 width=8) (actual time=47.491..47.492 rows=1 loops=1) -> WindowAgg (cost=0.29..3815.00 rows=10000 width=8) (actual time=47.489..47.490 rows=1 loops=1) -> Nested Loop (cost=0.29..3690.00 rows=10000 width=0) (actual time=0.026..42.972 rows=10000 loops=1) -> Seq Scan on tenk1 t2 (cost=0.00..445.00 rows=10000 width=4) (actual time=0.009..1.734 rows=10000 loops=1) -> Index Only Scan using tenk1_unique1 on tenk1 t1 (cost=0.29..0.31 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=10000) Index Cond: (unique1 = t2.tenthous) Heap Fetches: 0 Planning Time: 0.420 ms Execution Time: 48.107 ms You can see that the time to get the first WindowAgg row (47.489 ms) is not well aligned to the startup cost (0.29). This effectively causes the planner to choose a Nested Loop plan as it thinks it'll read just 1 row from the join. Due to the OVER (), we'll read all rows! Not good. It's not hard to imagine that a slightly different schema could yield a *far* worse plan if it opted to use a non-parameterised nested loop plan and proceed to read all rows from it. With the attached patch, that turns into: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=928.02..928.02 rows=1 width=8) (actual time=29.308..29.310 rows=1 loops=1) -> WindowAgg (cost=928.02..928.07 rows=10000 width=8) (actual time=29.306..29.308 rows=1 loops=1) -> Hash Join (cost=395.57..803.07 rows=10000 width=0) (actual time=10.674..22.032 rows=10000 loops=1) Hash Cond: (t1.unique1 = t2.tenthous) -> Index Only Scan using tenk1_unique1 on tenk1 t1 (cost=0.29..270.29 rows=10000 width=4) (actual time=0.036..4.961 rows=10000 loops=1) Heap Fetches: 0 -> Hash (cost=270.29..270.29 rows=10000 width=4) (actual time=10.581..10.582 rows=10000 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 480kB -> Index Only Scan using tenk1_thous_tenthous on tenk1 t2 (cost=0.29..270.29 rows=10000 width=4) (actual time=0.055..5.437 rows=10000 loops=1) Heap Fetches: 0 Planning Time: 2.415 ms Execution Time: 30.554 ms I'm not sure if we should consider backpatching a fix for this bug. We tend not to commit stuff that would destabilise plans in the back branches. On the other hand, it's fairly hard to imagine how we could make this much worse even given bad estimates. I do think we should fix this in v16, however. I'll add this to the "Older bugs affecting stable branches" section of the PG 16 open items list David [1] https://postgr.es/m/17862-1ab8f74b0f7b0...@postgresql.org
fix_bug_17862.patch
Description: Binary data