[
https://issues.apache.org/jira/browse/DRILL-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660915#comment-14660915
]
Victoria Markman commented on DRILL-3429:
-----------------------------------------
It looks like wrong result is only returned for +windowed+ statistical
aggregates.
Regular aggregates seem to work correctly (I will test them all):
{code}
0: jdbc:drill:schema=dfs> select
. . . . . . . . . . . . > ( SUM(POW(c_bigint, 2)) -
POW(SUM(c_bigint),2) / COUNT(c_bigint)) / (COUNT(c_bigint) - 1), -- just to
check if result is the same as var_samp
. . . . . . . . . . . . > VAR_SAMP(c_bigint)
. . . . . . . . . . . . > from
. . . . . . . . . . . . > j1;
+------------------------+------------------------+
| EXPR$0 | EXPR$1 |
+------------------------+------------------------+
| 5.5657975693740512E16 | 5.5657975693740512E16 |
+------------------------+------------------------+
1 row selected (0.901 seconds)
{code}
Windowed statistical aggregate:
{code}
0: jdbc:drill:schema=dfs> select VAR_SAMP(c_bigint) over() from j1 limit 1;
+-----------------------+
| EXPR$0 |
+-----------------------+
| -6.45296349643897E14 |
+-----------------------+
1 row selected (0.384 seconds)
{code}
Query plan for statistical aggregate:
{code}
0: jdbc:drill:schema=dfs> explain plan for select VAR_SAMP(c_integer) over()
from j1 limit 1;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 SelectionVectorRemover
00-03 Limit(fetch=[1])
00-04 Project(EXPR$0=[/(CastHigh(-($2, /(*($3, $3), $4))), -($4,
1))])
00-05 Window(window#0=[window(partition {} order by [] range
between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [SUM($1), SUM($0),
COUNT($0)])])
00-06 Project(c_integer=[$0], $1=[*($0, $0)])
00-07 Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=maprfs:///drill/testdata/subqueries/j1]],
selectionRoot=maprfs:/drill/testdata/subqueries/j1, numFiles=1,
columns=[`c_integer`]]])
{code}
Looks like we already CastHigh all the arguments in the rewritten expression:
{code}
0: jdbc:drill:schema=dfs> explain plan for select VAR_SAMP(c_integer) from j1
limit 1;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 SelectionVectorRemover
00-03 Limit(fetch=[1])
00-04 Project(EXPR$0=[CAST(/(-($0, /(*($1, $1), $2)), CASE(=($2, 1),
null, -($2, 1)))):ANY NOT NULL])
00-05 StreamAgg(group=[{}], agg#0=[SUM($1)], agg#1=[SUM($0)],
agg#2=[COUNT($0)])
00-06 Project(c_integer=[CastHigh($0)], $f1=[*(CastHigh($0),
CastHigh($0))])
00-07 Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=maprfs:///drill/testdata/subqueries/j1]],
selectionRoot=maprfs:/drill/testdata/subqueries/j1, numFiles=1,
columns=[`c_integer`]]])
{code}
> DrillAvgVarianceConvertlet may produce wrong results while rewriting stddev,
> variance
> -------------------------------------------------------------------------------------
>
> Key: DRILL-3429
> URL: https://issues.apache.org/jira/browse/DRILL-3429
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Mehant Baid
> Assignee: Mehant Baid
> Fix For: 1.2.0
>
>
> DrillAvgVarianceConvertlet currently rewrites aggregate functions like avg,
> stddev, variance to simple computations.
> Eg:
> Stddev( x ) => power(
> (sum(x * x) - sum( x ) * sum( x ) / count( x ))
> / count( x ),
> .5)
> Consider the case when the input is an integer. Now the rewrite contains
> multiplication and division, which will bind to functions that operate on
> integers however the expected result should be a double and since double has
> more precision than integer we should be operating on double during the
> multiplication and division.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)