ydgandhi opened a new pull request, #21088: URL: https://github.com/apache/datafusion/pull/21088
Add MultiDistinctCountRewrite in datafusion-optimizer and register it in Optimizer::new() after SingleDistinctToGroupBy. Rewrites 2+ simple COUNT(DISTINCT) on different args into a join of two-phase aggregates; filter distinct_arg IS NOT NULL on each branch for correct NULL semantics. ✅ Unit tests in datafusion-optimizer; ✅ SQL integration test (NULLs) in core_integration. ## Which issue does this PR close? - Closes #21087. ## Rationale for this change Queries with multiple COUNT(DISTINCT col_i) in the same GROUP BY can force independent distinct state per aggregate (e.g. separate hash sets), which scales poorly in memory when several high-cardinality distinct columns appear together. DataFusion already optimizes the single shared distinct field case via SingleDistinctToGroupBy. This PR adds a conservative logical rewrite for multiple distinct COUNT(DISTINCT …) arguments by splitting work into per-distinct branches joined on the group keys, which reduces peak memory for eligible plans. COUNT(DISTINCT x) must ignore NULL x; the rewrite applies x IS NOT NULL on each distinct branch before inner grouping so semantics stay aligned with count_distinct behavior. ## What changes are included in this PR? New module: - datafusion/optimizer/src/multi_distinct_count_rewrite.rs — MultiDistinctCountRewrite (OptimizerRule, bottom-up). - Register rule in datafusion/optimizer/src/optimizer.rs immediately after SingleDistinctToGroupBy. - Export module from datafusion/optimizer/src/lib.rs. Tests: - datafusion-optimizer: rewrite vs no-op cases (single distinct, mixed aggs, two distincts). - datafusion core_integration: datafusion/core/tests/sql/aggregates/multi_distinct_count_rewrite.rs — SQL over MemTable with NULLs. ## Are these changes tested? Yes. - cargo test -p datafusion-optimizer (includes new unit tests). - cargo test -p datafusion --test core_integration multi_count_distinct (SQL + NULLs). ## Are there any user-facing changes? - Behavior / plans: For queries that match the rule, logical plans (and thus EXPLAIN output) can differ: eligible multi–COUNT(DISTINCT) aggregates may appear as joins of sub-aggregates instead of a single Aggregate with multiple distinct counts. - Results: Intended to be semantics-preserving for supported patterns (including NULL handling via filters). - Public API: No intentional breaking changes to public Rust APIs; this is an internal optimizer rule enabled by default. - Docs: No user guide update required unless maintainers want an “optimizer behavior” note; can add if requested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
