buraksenn opened a new pull request, #20926:
URL: https://github.com/apache/datafusion/pull/20926

   ## Which issue does this PR close?
   Part of https://github.com/apache/datafusion/issues/20766
   
   ## Rationale for this change
   Grouped aggregations currently estimate output rows as input_rows, ignoring 
available NDV statistics. Spark's AggregateEstimation and Trino's 
AggregationStatsRule both use NDV products to tighten this estimate. This PR is 
highly referenced by both.
   
   
     - [Spark 
reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61)
     - [Trino 
reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101)
   
   ## What changes are included in this PR?
   - Estimate aggregate output rows as min(input_rows, product(NDV_i + 
null_adj_i) * grouping_sets)
   - Cap by Top K limit when active since output row cannot be higher than K
   - Propagate distinct_count from child stats to group-by output columns
   
   ## Are these changes tested?
   Yes existing and new tests that cover different scenarios and edge cases
   
   
   ## Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to