[ 
https://issues.apache.org/jira/browse/CALCITE-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17995699#comment-17995699
 ] 

Silun Dong edited comment on CALCITE-7083 at 7/4/25 8:16 AM:
-------------------------------------------------------------

Let me add some details, if I understand correctly. 

The default implementation of {{getDistinctRowCount(RelNode rel, 
ImmutableBitSet groupKey, @Nullable RexNode predicate)}} for aggregate 
calculates the distinct row count of the aggregate input (based on the columns 
included in the {{{}groupKey{}}}). When the {{groupKey}} contains the 
aggregated columns, this calculation logic make no sense.

case.1
||name||sal||
|b|10|
|b|20|
|b|30|

after group by name:
||name||sum(sal)||
|b|60|

if {{groupKey}} is sum(sal), it will calculate the distinct value of sal and 
the result is 3, but the actual result should be 1.

case.2
||name||sal||
|a|10|
|b|10|
|c|10|

after group by name:
||name||sum(sal)||
|a|10|
|b|10|
|c|10|

if {{groupKey}} is sum(sal), it will calculate the distinct value of sal and 
the result is 1, but the actual result should be 3.

 

Essentially, we cannot infer the calculation results of aggregate functions. So 
when the {{groupKey}} contains the aggregated column, ??the handler should 
return null or (at most) the full cardinal of the aggregate??. I think it is a 
good idea to return the full row count. This is the most conservative estimate 
and can provide a reliable reference for the optimizer compared to returning 
null.


was (Author: JIRAUSER308615):
Let me add some details, if I understand correctly. 

The default implementation of {{getDistinctRowCount(RelNode rel, 
ImmutableBitSet groupKey, @Nullable RexNode predicate)}} for aggregate 
calculates the distinct row count of the aggregate input (based on the columns 
included in the {{{}groupKey{}}}). When the {{groupKey}} contains the 
aggregated columns, this calculation logic make no sense.

case.1

 
||name||sal||
|b|10|
|b|20|
|b|30|

after group by name:

 

 
||name||sum(sal)||
|b|60|

if {{groupKey}} is sum(sal), it will calculate the distinct value of sal and 
the result is 3, but the actual result should be 1.

 

 

case.2

 
||name||sal||
|a|10|
|b|10|
|c|10|

after group by name:

 
||name||sum(sal)||
|a|10|
|b|10|
|c|10|

if {{groupKey}} is sum(sal), it will calculate the distinct value of sal and 
the result is 1, but the actual result should be 3.

 

Essentially, we cannot infer the calculation results of aggregate functions. So 
when the {{groupKey}} contains the aggregated column, ??the handler should 
return null or (at most) the full cardinal of the aggregate??. I think it is a 
good idea to return the full row count. This is the most conservative estimate 
and can provide a reliable reference for the optimizer compared to returning 
null.

> RelMdDistinctRowCount aggregates implementation problems
> --------------------------------------------------------
>
>                 Key: CALCITE-7083
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7083
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.40.0
>            Reporter: Claude Brisson
>            Priority: Major
>
> The default implementation of getDistinctRowCount for aggregates has several 
> problems:
> - when determining the pushable predicates, it makes the assumption that the 
> aggregate group key is a zero-based range, which is not necessarily the case 
> (the indices in the aggregate group key are the child indices, the predicates 
> are expressed in terms of the zero-based output range)
> - if there is any aggregated column in the queried group key, then it makes 
> no sense to query the distinct values on the aggregate input, the handler 
> should return null or (at most) the full cardinal of the aggregate



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to