[jira] [Commented] (DRILL-3429) DrillAvgVarianceConvertlet may produce wrong results while rewriting stddev, variance

Victoria Markman (JIRA) Thu, 06 Aug 2015 15:47:17 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660915#comment-14660915
 ]


Victoria Markman commented on DRILL-3429:
-----------------------------------------

It looks like wrong result is only returned for +windowed+ statistical 
aggregates. 
Regular aggregates seem to work correctly (I will test them all):

{code}
0: jdbc:drill:schema=dfs> select 
. . . . . . . . . . . . >         ( SUM(POW(c_bigint, 2)) - 
POW(SUM(c_bigint),2) / COUNT(c_bigint)) / (COUNT(c_bigint) - 1),  -- just to 
check if result is the same as var_samp
. . . . . . . . . . . . >         VAR_SAMP(c_bigint)
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         j1;
+------------------------+------------------------+
|         EXPR$0         |         EXPR$1         |
+------------------------+------------------------+
| 5.5657975693740512E16  | 5.5657975693740512E16  |
+------------------------+------------------------+
1 row selected (0.901 seconds)
{code}

Windowed statistical aggregate:
{code}
0: jdbc:drill:schema=dfs> select VAR_SAMP(c_bigint) over() from j1 limit 1;
+-----------------------+
|        EXPR$0         |
+-----------------------+
| -6.45296349643897E14  |
+-----------------------+
1 row selected (0.384 seconds)
{code}

Query plan for statistical aggregate:
{code}
0: jdbc:drill:schema=dfs> explain plan for select VAR_SAMP(c_integer) over() 
from j1 limit 1;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        SelectionVectorRemover
00-03          Limit(fetch=[1])
00-04            Project(EXPR$0=[/(CastHigh(-($2, /(*($3, $3), $4))), -($4, 
1))])
00-05              Window(window#0=[window(partition {} order by [] range 
between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [SUM($1), SUM($0), 
COUNT($0)])])
00-06                Project(c_integer=[$0], $1=[*($0, $0)])
00-07                  Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=maprfs:///drill/testdata/subqueries/j1]], 
selectionRoot=maprfs:/drill/testdata/subqueries/j1, numFiles=1, 
columns=[`c_integer`]]])
{code}

Looks like we already CastHigh all the arguments in the rewritten expression:
{code}
0: jdbc:drill:schema=dfs> explain plan for select VAR_SAMP(c_integer) from j1 
limit 1;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        SelectionVectorRemover
00-03          Limit(fetch=[1])
00-04            Project(EXPR$0=[CAST(/(-($0, /(*($1, $1), $2)), CASE(=($2, 1), 
null, -($2, 1)))):ANY NOT NULL])
00-05              StreamAgg(group=[{}], agg#0=[SUM($1)], agg#1=[SUM($0)], 
agg#2=[COUNT($0)])
00-06                Project(c_integer=[CastHigh($0)], $f1=[*(CastHigh($0), 
CastHigh($0))])
00-07                  Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=maprfs:///drill/testdata/subqueries/j1]], 
selectionRoot=maprfs:/drill/testdata/subqueries/j1, numFiles=1, 
columns=[`c_integer`]]])
{code}


> DrillAvgVarianceConvertlet may produce wrong results while rewriting stddev, 
> variance
> -------------------------------------------------------------------------------------
>
>                 Key: DRILL-3429
>                 URL: https://issues.apache.org/jira/browse/DRILL-3429
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Mehant Baid
>            Assignee: Mehant Baid
>             Fix For: 1.2.0
>
>
> DrillAvgVarianceConvertlet currently rewrites aggregate functions like avg, 
> stddev, variance to simple computations. 
> Eg: 
> Stddev( x ) => power(
>          (sum(x * x) - sum( x ) * sum( x ) / count( x ))
>          / count( x ),
>          .5)
> Consider the case when the input is an integer. Now the rewrite contains 
> multiplication and division, which will bind to functions that operate on 
> integers however the expected result should be a double and since double has 
> more precision than integer we should be operating on double during the 
> multiplication and division.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-3429) DrillAvgVarianceConvertlet may produce wrong results while rewriting stddev, variance

Reply via email to