xumingming opened a new pull request, #55925:
URL: https://github.com/apache/spark/pull/55925
### What changes were proposed in this pull request?
Adds RewriteCountDistinctConditional optimizer rule that canonicalizes:
COUNT(DISTINCT IF(cond, base, NULL))
COUNT(DISTINCT CASE WHEN cond THEN base END)
into:
COUNT(DISTINCT base) FILTER (WHERE cond)
This reduces the number of distinct groups seen by RewriteDistinctAggregates
from N (one per unique conditional expression) down to 1 (all share the same
base column), collapsing the Expand factor from Nx to 1x.
Gated by spark.sql.optimizer.rewriteCountDistinctConditional.enabled
(default: false).
Includes comprehensive unit tests for rewrite patterns and safety boundaries.
### Why are the changes needed?
When a query contains many COUNT(DISTINCT IF(cond_i, col, NULL)) expressions
over the same base column, RewriteDistinctAggregates treats each unique IF(...)
expression as a distinct group. N conditions → N distinct groups → N× Expand
amplification. In production workloads with 25–60 such expressions, this
produces multi-terabyte shuffles and hour-long runtimes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit Test.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]