[ 
https://issues.apache.org/jira/browse/CALCITE-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390494#comment-16390494
 ] 

Julian Hyde commented on CALCITE-2202:
--------------------------------------

Your analysis seems mostly correct.

One thing you forgot is "GROUP BY ()". "()" is a 0-tuple that has 1 value. 
"SELECT ... FROM t GROUP BY ()" has one row, even if t is empty. This can have 
some surprising effects if you are assuming, for instance, that an aggregate 
never has more rows than its input relation.

Outer joins are surprisingly common, so it's worth dealing with them properly. 
I am not convinced that you can apply your partitioning argument to 
outer-theta-join, for example.

You are correct that "COUNT(*)" counts all rows but I like to think of it as a 
special case of n-ary count where n = 0. (Usual COUNT is 1-ary, e.g. 
"COUNT(a)", but you can have 2-ary, "COUNT(a, b)", etc.) "COUNT", like most 
aggregate functions, ignores a row if any of its arguments are null, and 0-ary 
count clearly never receives a null argument.

When pushing expressions (filters and projects) through outer joins, we needed 
to know whether an expression "f(a)" being not-null after the join meant that 
"a" was not-null before the join. If that reasoning is useful for aggregates, 
see the {{Strong}} class.

Regarding your statement
{quote}it's apparent that aggregation can be pushed on on a single side (either 
side), and leave the other side non-aggregated, regardless of whether grouping 
columns are unique on the other side
{quote}
I don't agree. Duplicates do matter. Consider the following schema where there 
are duplicate rows with deptno = 10 in both emp and dept tables:
{code:java}
create table dept (deptno int, name char(20));
insert into dept values (10, 'sales');
insert into dept values (10, 'marketing');
insert into dept values (20, 'engineering');
create table emp (deptno int, sal int);
insert into emp values (10, 200);
insert into emp values (10, 300);
insert into emp values (10, 500);
select emp.deptno, sum(sal) as s
from emp join dept on emp.deptno = dept.deptno
group by emp.deptno;;
{code}
returns s = 2000 (200 * 2 + 300 * 2 + 500 * 2). I call the 2 the 
"cross-multiplier", because there are 2 values with deptno = 10 in dept.

The pushed-down query is
{code:java}
select sum(e.s * d.c)
from (select deptno, sum(sal) as s from emp) as e
join (select deptno, count(*) as c from dept) as d
on e.deptno = d.deptno
group by e.deptno{code}
and you can see the cross-multiplication in {{sum(e.s * d.c)}}.

If you know that {{dept.deptno}} is unique then you know that d.c is always 1, 
and so {{sum(e.s * d.c)}} becomes just {{sum(e.s)}}. That's how I'd approach it 
– first get the formula right without assuming uniqueness constraints, then 
simplify given the uniqueness constraints you have.

> Aggregate Join Push-down on a Single Side
> -----------------------------------------
>
>                 Key: CALCITE-2202
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2202
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: next
>            Reporter: Zhong Yu
>            Assignee: Julian Hyde
>            Priority: Major
>             Fix For: next
>
>
> While investigating https://issues.apache.org/jira/browse/CALCITE-2195, it's 
> apparent that aggregation can be pushed on on a single side (either side), 
> and leave the other side non-aggregated, regardless of whether grouping 
> columns are unique on the other side. My analysis – 
> [http://zhong-j-yu.github.io/aggregate-join-push-down.pdf] .
> This may be useful when the metadata is insufficient; in any case, we may try 
> to provide all 3 possible transformations (aggregate on left only; right 
> only; both sides) to the cost based optimizer, so that the cheapest one can 
> be chosen based on stats. 
> Does this make any sense, anybody? If it sounds good, I'll implement it and 
> offer a PR. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to