Gabriel Lee created SPARK-36289:
-----------------------------------

             Summary: rewrite distinct count case when expressions without 
Expand node
                 Key: SPARK-36289
                 URL: https://issues.apache.org/jira/browse/SPARK-36289
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Gabriel Lee


Currently, RewriteDistinctAggregates rule will rewrite distinct aggregates with 
extra Expand node. This causes unnecessary memory waste and performance 
fallback for some specific aggregates(e.g. count distinct case when).

We can rewrite count distinct case when aggregates without Expand:

for query:

* {{{
 * SELECT
 * cat1,
 * COUNT(DISTINCT CASE WHEN cond1 THEN cat2 ELSE null end) as cat2_cnt1,
 * COUNT(DISTINCT CASE WHEN cond2 THEN cat2 ELSE null end) as cat2_cnt2,
 * FROM
 * data
 * GROUP BY
 * key
 * }}}

 

we should rewrite to :

* {{{
 * Aggregate(
 * key = ['key]
 * functions = [count(if (01 & 'bit_vector != 0) 0 else null),
 * count(if (10 & 'bit_vector != 0) 0 else null)]
 * output = ['key, 'cat2_cnt1, 'cat2_cnt2])
 * Aggregate(
 * key = ['key, 'cat1]
 * functions = [bit_or(if (cond1) 01 else 00, if (cond2) 10 else 00)]
 * output = ['key, 'cat1, 'bit_vector])
 * LocalTableScan [...]
 * }}}

 

This method will improve performance and reduce memory waste



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to