I've switched back to this thread and will begin by working through the key concerns that were previously raised.
The first concern is the lack of a proof demonstrating the correctness of this transformation. To address this, I plan to include a detailed proof in the README, along the lines of the following. ====== proof start ====== To prove that the transformation is correct, we partition the tables in the FROM clause into two groups: those that contain at least one aggregation column, and those that do not contain any aggregation columns. Each group can be treated as a single relation formed by the Cartesian product of the tables within that group. Therefore, without loss of generality, we can assume that the FROM clause contains exactly two relations, R1 and R2, where R1 represents the relation containing all aggregation columns, and R2 represents the relation without any aggregation columns. Let the query be of the form: SELECT G, AGG(A) FROM R1 JOIN R2 ON J GROUP BY G; where G is the set of grouping keys that may include columns from R1 and/or R2; AGG(A) is an aggregate function over columns A from R1; J is the join condition between R1 and R2. The transformation of eager aggregation is: GROUP BY G, AGG(A) on (R1 JOIN R2 ON J) = GROUP BY G, AGG(agg_A) on ((GROUP BY G1, AGG(A) AS agg_A on R1) JOIN R2 ON J) This equivalence holds under the following conditions: 1) AGG is decomposable, meaning that it can be computed in two stages: a partial aggregation followed by a final aggregation; 2) The set G1 used in the pre-aggregation of R1 includes: * all columns from R1 that are part of the grouping keys G, and * all columns from R1 that appear in the join condition J. 3) The grouping operator for any column in G1 must be compatible with the operator used for that column in the join condition J. Since G1 includes all columns from R1 that appear in either the grouping keys G or the join condition J, all rows within each partial group have identical values for both the grouping keys and the join-relevant columns from R1, assuming compatible operators are used. As a result, the rows within a partial group are indistinguishable in terms of their contribution to the aggregation and their behavior in the join. This ensures that all rows in the same partial group share the same "destiny": they either all match or all fail to match a given row in R2. Because the aggregate function AGG is decomposable, aggregating the partial results after the join yields the same final result as aggregating after the full join, thereby preserving query semantics. Q.E.D. The second concern is that a RelOptInfo representing a grouped relation may include paths that produce different row sets due to partial aggregation being applied at different join levels. This potentially violates a fundamental assumption in the planner. Additionally, the patch currently performs an exhaustive search by exploring partial aggregation at every possible join level, leading to excessive planning effort, which may not be justified by the cost-benefit ratio. To address these concerns, I'm thinking that maybe we can adopt a strategy where partial aggregation is only pushed to the lowest possible level in the join tree that is deemed useful. In other words, if we can build a grouped path like "AGG(B) JOIN A" -- and AGG(B) yields a significant reduction in row count -- we skip exploring alternatives like "AGG(A JOIN B)". This is somewhat analogous to how we handle qual clauses: we only push a qual clause down to the lowest scan or join level that includes all the relations it references -- following the "filter early, join late" principle. For example, if predicate Pb only references B, we only consider "A JOIN sigma[Pb](B)" and skip "sigma[Pb](A JOIN B)". (Note that if Pb involves costly functions and the join is highly selective, we may want to apply the predicate after the join.) This ensures that all grouped paths for the same grouped relation produce the same set of rows (e.g., consider "A JOIN AGG(B) JOIN C" vs. "AGG(B) JOIN C JOIN A"). As a result, we avoid the complexity of comparing costs between different grouped paths of the same grouped relation, and also eliminate the need for special handling of row estimates on join paths. It also significantly reduces planning effort. While this approach may miss potentially more efficient plans where applying partial aggregation at a higher join level would yield better performance, it strikes a practical balance: we can still find plans that outperform those without eager aggregation, without incurring excessive planning overhead. As discussed earlier, it's uncommon in practice to encounter multiple joins that dramatically inflate row counts. So in most cases, pushing partial aggregation to the lowest level where it offers a significant row count reduction tends to be the most efficient strategy. I think this heuristic serves as a good starting point, and we can look into extending it with more advanced strategies as the feature evolves. Any thoughts? Thanks Richard