Gabriel Lee created SPARK-36289:
-----------------------------------
Summary: rewrite distinct count case when expressions without
Expand node
Key: SPARK-36289
URL: https://issues.apache.org/jira/browse/SPARK-36289
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.2.0
Reporter: Gabriel Lee
Currently, RewriteDistinctAggregates rule will rewrite distinct aggregates with
extra Expand node. This causes unnecessary memory waste and performance
fallback for some specific aggregates(e.g. count distinct case when).
We can rewrite count distinct case when aggregates without Expand:
for query:
* {{{
* SELECT
* cat1,
* COUNT(DISTINCT CASE WHEN cond1 THEN cat2 ELSE null end) as cat2_cnt1,
* COUNT(DISTINCT CASE WHEN cond2 THEN cat2 ELSE null end) as cat2_cnt2,
* FROM
* data
* GROUP BY
* key
* }}}
we should rewrite to :
* {{{
* Aggregate(
* key = ['key]
* functions = [count(if (01 & 'bit_vector != 0) 0 else null),
* count(if (10 & 'bit_vector != 0) 0 else null)]
* output = ['key, 'cat2_cnt1, 'cat2_cnt2])
* Aggregate(
* key = ['key, 'cat1]
* functions = [bit_or(if (cond1) 01 else 00, if (cond2) 10 else 00)]
* output = ['key, 'cat1, 'bit_vector])
* LocalTableScan [...]
* }}}
This method will improve performance and reduce memory waste
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]