[ 
https://issues.apache.org/jira/browse/CALCITE-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936089#comment-17936089
 ] 

Julian Hyde commented on CALCITE-6893:
--------------------------------------

Thank you. I understand it better now.

If you had a similar query with {{job}} rather than {{ename}}, would be 
transformation be valid? I don't believe so. SQL0 is
{code:java}
select job from emp where deptno = 10
intersect
select job from emp where deptno = 20;
+---------+
|   JOB   |
+---------+
| CLERK   |
| MANAGER |
+---------+
2 rows selected (0.005 seconds)
{code}
and SQL1 is
{code:java}
select job
from (
  select job, count(*) as c
  from (
    select job from emp
    where deptno = 10
    union all
    select job from emp
    where deptno = 20)
  group by job)
where c = 2;
+---------+
|   JOB   |
+---------+
| MANAGER |
| ANALYST |
+---------+
2 rows selected (0.003 seconds)
{code}
and as you can see, they give different results on the SCOTT database. The 'c = 
2' check is designed to check that there is at least one employee with that job 
in both branches (i.e. dept 10 and 20), but there are two analysts in dept 20, 
and without the 'group by' that will satisfy the 'c = 2' condition.

Your transformation works for the 'ename' query because in the small SCOTT data 
set, 'ename' is unique. Your transformation is valid if the column is genuinely 
unique (e.g. 'empno') but in that case, there's already a rule (I don't recall 
its name) to remove a GROUP BY.

> Remove agg from Union children in IntersectToDistinctRule
> ---------------------------------------------------------
>
>                 Key: CALCITE-6893
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6893
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Zhen Chen
>            Assignee: Zhen Chen
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-03-16-11-02-04-725.png
>
>
> Whether agg should be pushed down should be determined by the cost.
> SQL:
>  
> {code:java}
> select ename from emp where deptno = 10
> intersect
> select ename from emp where deptno = 20{code}
>  
> Then used rule INTERSECT_TO_DISTINCT(updated version)  and 
> AGGREGATE_UNION_TRANSPOSE in hep planner.
> We can get logical plan:
>  
> {code:java}
> LogicalProject(ENAME=[$0])
>   LogicalFilter(condition=[=($1, 2)])
>     LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])
>       LogicalUnion(all=[true])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 10)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 20)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> Then we also use the two same rules in volcanol planner.
>  
> Final Phy Plan:
>  
> {code:java}
> EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = 
> {43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85
>   EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = 
> {42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84
>     EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, 
> cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83
>       EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = 
> {40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79
>           EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81
>           EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>  {code}
>  
> We can see the best plan, the children of union do not have agg. 
> DAG:
> !image-2025-03-16-11-02-04-725.png|width=545,height=348!
> Currently, Calcite does not support distributed planning. If in a distributed 
> planning, agg will be divided into two stages. If the filtering effect in the 
> first stage is very good, the downward push of agg will be meaningful and 
> reduce the network transmission of shuffle. However, optimizing the current 
> rule is also meaningful. Calcite now also has rules that can do the downward 
> push of agg. We can give the choice to the volcano.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to