Maybe you could tweak the cost of Aggregate so that the cost is very high if AVG is present. (We don’t hit this problem in core Calcite, because we don’t have an implementation of AVG so expanding to SUM / COUNT is the only viable option.)
I would love to unify how we process aggregate functions and windowed aggregate functions. Just this week someone pointed out[1] that we optimize “COUNT(x)” to “COUNT(*)” if x is not null, but we do not optimize “COUNT(x) OVER w” to “COUNT(*) OVER w”. Julian [1] https://issues.apache.org/jira/browse/CALCITE-2126 <https://issues.apache.org/jira/browse/CALCITE-2126> > On Jan 5, 2018, at 5:42 PM, Michael Alexeev <[email protected]> wrote: > > I actually tried to add the AggregateReduceFunctionsRule but the cumulative > cost of the SUM / COUNT expression was always higher than the cost of the > original AGR making it a winner. > > Not sure how to get around it. > > According to the documentation, The AvgVarianceConvertlet.expandAvg is > invoked in case of windowed aggregate functions and not the regular aggregates > like in my case. > > Mike > > On Wed, Jan 3, 2018 at 9:15 PM, Julian Hyde <[email protected]> wrote: > >> Calcite’s usual behavior is to split AVG(x) into SUM(x) / COUNT(x). >> (Actually something slightly more complicated, to deal with COUNT = 0.) >> >> Normally, this happens during sql-to-rel conversion: the expandAvg method >> in StandardConvertletTable[1]. If you add a test for say “select avg(sal) >> from emp” to SqlToRelConvererTest you should see that happening. >> >> But if you are going straight to RelNode, there is also a rule you can >> use: AggregateReduceFunctionsRule [2]. >> >> Julian >> >> [1] https://insight.io/github.com/apache/calcite/blob/master/ >> core/src/main/java/org/apache/calcite/sql2rel/ >> StandardConvertletTable.java?line=1194 <https://insight.io/github. >> com/apache/calcite/blob/master/core/src/main/java/org/ >> apache/calcite/sql2rel/StandardConvertletTable.java?line=1194> >> >> [2] https://insight.io/github.com/apache/calcite/blob/master/ >> core/src/main/java/org/apache/calcite/rel/rules/ >> AggregateReduceFunctionsRule.java?line=82 <https://insight.io/github. >> com/apache/calcite/blob/master/core/src/main/java/org/ >> apache/calcite/rel/rules/AggregateReduceFunctionsRule.java?line=82> >> >> >>> On Jan 3, 2018, at 4:20 PM, Michael Alexeev <[email protected]> >> wrote: >>> >>> Hi All, >>> >>> Please consider a simple aggregated query >>> >>> select avg(C) from T; >>> >>> which is translated into a following relational expression >>> >>> LogicalAggregate(group=[{}], EXPR$0=[AVG($0)]) >>> LogicalProject(C=[$2]) >>> VoltDBTableSeqScan(table=[[T]], expr#0..5=[{inputs}], >>> proj#0..5=[{exprs}]) >>> >>> And this works perfectly fine if a table physically located on a single >>> physical machine. >>> >>> But in case of a distributed table (partitioned table where each >> partition >>> resides on a different physical node, for example) this query needs to be >>> run on each individual node, each node's results are sent to a dedicated >>> coordinator node which does the final aggregation across the nodes. To >>> calculate a column's AVG in this case, each node has to calculate two >>> aggregates SUM(T.C) and COUNT(T.C) and the coordinator node has also >>> compute the SUM and COUNT across the nodes and divide the final SUM by >> the >>> final COUNT. >>> >>> I wonder if there is a way to force Calcite to translate the query above >>> into >>> LogicalAggregate(group=[{}], EXPR$0=[SUM($0)], EXPR$1=[COUNT($0)]) >>> LogicalProject(C=[$2]) >>> VoltDBTableSeqScan(table=[[T]], expr#0..5=[{inputs}], >>> proj#0..5=[{exprs}]) >>> >>> instead of a single AVG aggregate? >>> >>> Thanks, >>> Mike >> >>
