ydgandhi opened a new pull request, #21088:
URL: https://github.com/apache/datafusion/pull/21088

   Add MultiDistinctCountRewrite in datafusion-optimizer and register it in 
Optimizer::new() after SingleDistinctToGroupBy. Rewrites 2+ simple 
COUNT(DISTINCT) on different args into a join of two-phase aggregates; filter 
distinct_arg IS NOT NULL on each branch for correct NULL semantics.
   
   ✅ Unit tests in datafusion-optimizer; ✅ SQL integration test (NULLs) in 
core_integration.
   
   ## Which issue does this PR close?
   
   - Closes #21087.
   
   ## Rationale for this change
   
   Queries with multiple COUNT(DISTINCT col_i) in the same GROUP BY can force 
independent distinct state per aggregate (e.g. separate hash sets), which 
scales poorly in memory when several high-cardinality distinct columns appear 
together.
   
   DataFusion already optimizes the single shared distinct field case via 
SingleDistinctToGroupBy. This PR adds a conservative logical rewrite for 
multiple distinct COUNT(DISTINCT …) arguments by splitting work into 
per-distinct branches joined on the group keys, which reduces peak memory for 
eligible plans.
   
   COUNT(DISTINCT x) must ignore NULL x; the rewrite applies x IS NOT NULL on 
each distinct branch before inner grouping so semantics stay aligned with 
count_distinct behavior.
   
   ## What changes are included in this PR?
   
   New module: 
   - datafusion/optimizer/src/multi_distinct_count_rewrite.rs — 
MultiDistinctCountRewrite (OptimizerRule, bottom-up).
   - Register rule in datafusion/optimizer/src/optimizer.rs immediately after 
SingleDistinctToGroupBy.
   - Export module from datafusion/optimizer/src/lib.rs.
   
   Tests:
   - datafusion-optimizer: rewrite vs no-op cases (single distinct, mixed aggs, 
two distincts).
   - datafusion core_integration: 
datafusion/core/tests/sql/aggregates/multi_distinct_count_rewrite.rs — SQL over 
MemTable with NULLs.
   
   ## Are these changes tested?
   
   Yes.
   
   - cargo test -p datafusion-optimizer (includes new unit tests).
   - cargo test -p datafusion --test core_integration multi_count_distinct (SQL 
+ NULLs).
   
   ## Are there any user-facing changes?
   
   - Behavior / plans: For queries that match the rule, logical plans (and thus 
EXPLAIN output) can differ: eligible multi–COUNT(DISTINCT) aggregates may 
appear as joins of sub-aggregates instead of a single Aggregate with multiple 
distinct counts.
   
   - Results: Intended to be semantics-preserving for supported patterns 
(including NULL handling via filters).
   
   - Public API: No intentional breaking changes to public Rust APIs; this is 
an internal optimizer rule enabled by default.
   
   - Docs: No user guide update required unless maintainers want an “optimizer 
behavior” note; can add if requested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to