[jira] [Commented] (CALCITE-6893) Remove agg from Union children in IntersectToDistinctRule

Julian Hyde (Jira) Mon, 17 Mar 2025 11:55:09 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936278#comment-17936278
 ]


Julian Hyde commented on CALCITE-6893:
--------------------------------------

That query is valid. I would support having a variant of the rule that 
generates count-distinct.

However, many implementations of count-distinct will introduce an intermediate 
"Aggregate(job, i)" step, and these will be equivalent to the current plan. 
(Some implementations, such as BigQuery, will use a small array; for these 
implementations, count-distinct will be a win.)

If the number of branches (N) is small, we can use other aggregate functions. 
If N = 2 we can use min and max:
{code}
select job
from (
  select job, 0 as i from "scott".emp where deptno = 10
  union all
  select job, 1 as i from "scott".emp where deptno = 20)
group by job
having min(i) = 0 and max(i) = 1;
{code}

For small N we can use separate {{count}} aggregate functions:
{code}
select job
from (
  select job, 0 as i from "scott".emp where deptno = 10
  union all
  select job, 1 as i from "scott".emp where deptno = 20)
group by job
having count(*) filter (where i = 0) > 0
   and count(*) filter (where i = 1) > 0;
{code}
or equivalently
{code}
select job
from (
  select job, 1 as i0, null as i1 from "scott".emp where deptno = 10
  union all
  select job, null as i0, 1 as i1 from "scott".emp where deptno = 20)
group by job
having count(i0) > 0
   and count(i1) > 0;
{code}


> Remove agg from Union children in IntersectToDistinctRule
> ---------------------------------------------------------
>
>                 Key: CALCITE-6893
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6893
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Zhen Chen
>            Assignee: Zhen Chen
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-03-16-11-02-04-725.png
>
>
> Whether agg should be pushed down should be determined by the cost.
> SQL:
>  
> {code:java}
> select ename from emp where deptno = 10
> intersect
> select ename from emp where deptno = 20{code}
>  
> Then used rule INTERSECT_TO_DISTINCT(updated version)  and 
> AGGREGATE_UNION_TRANSPOSE in hep planner.
> We can get logical plan:
>  
> {code:java}
> LogicalProject(ENAME=[$0])
>   LogicalFilter(condition=[=($1, 2)])
>     LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])
>       LogicalUnion(all=[true])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 10)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 20)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> Then we also use the two same rules in volcanol planner.
>  
> Final Phy Plan:
>  
> {code:java}
> EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = 
> {43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85
>   EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = 
> {42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84
>     EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, 
> cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83
>       EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = 
> {40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79
>           EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81
>           EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>  {code}
>  
> We can see the best plan, the children of union do not have agg. 
> DAG:
> !image-2025-03-16-11-02-04-725.png|width=545,height=348!
> Currently, Calcite does not support distributed planning. If in a distributed 
> planning, agg will be divided into two stages. If the filtering effect in the 
> first stage is very good, the downward push of agg will be meaningful and 
> reduce the network transmission of shuffle. However, optimizing the current 
> rule is also meaningful. Calcite now also has rules that can do the downward 
> push of agg. We can give the choice to the volcano.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6893) Remove agg from Union children in IntersectToDistinctRule

Reply via email to