[jira] [Commented] (DRILL-2309) Selecting count(), avg() of nullable columns causes wrong results

Aman Sinha (JIRA) Tue, 17 Mar 2015 23:03:08 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366694#comment-14366694
 ]


Aman Sinha commented on DRILL-2309:
-----------------------------------

+1.  Couple of suggestions: is the ordinal guaranteed to be less than 
inputExprs.size() ?  Perhaps add an assert to
 be safe since the ordinal is coming from the aggregate call.  
For testing, it would be good to add one where count and avg() are done on the 
same column.  

> Selecting count(), avg() of nullable columns causes wrong results
> -----------------------------------------------------------------
>
>                 Key: DRILL-2309
>                 URL: https://issues.apache.org/jira/browse/DRILL-2309
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 0.8.0
>            Reporter: Chun Chang
>            Assignee: Mehant Baid
>            Priority: Critical
>             Fix For: 0.9.0
>
>         Attachments: DRILL-2309.patch
>
>
> #Thu Feb 19 18:40:10 EST 2015
> git.commit.id.abbrev=1ceddff
> The following query returns correct count involving columns that contains 
> null value.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul) 
> from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by 
> tt.gbyi order by tt.gbyi;
> +------------+------------+
> |    gbyi    |   EXPR$1   |
> +------------+------------+
> | 0          | 33580      |
> | 1          | 33317      |
> | 2          | 33438      |
> | 3          | 33535      |
> | 4          | 33369      |
> | 5          | 32990      |
> | 6          | 33661      |
> | 7          | 33130      |
> | 8          | 33362      |
> | 9          | 33364      |
> | 10         | 33229      |
> | 11         | 33567      |
> | 12         | 33379      |
> | 13         | 33045      |
> | 14         | 33305      |
> +------------+------------+
> {code}
> But if you add more aggregation to the query, the returned count is wrong 
> (pay attention to the last column). 
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id), 
> avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from 
> `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
> +------------+------------+------------+------------+
> |    gbyi    |   EXPR$1   |   EXPR$2   |   EXPR$3   |
> +------------+------------+------------+------------+
> | 0          | 33445554017 | 499613.0956877819 | 66943      |
> | 1          | 33209358334 | 500760.0252919893 | 66318      |
> | 2          | 33369118041 | 498091.82200273 | 66994      |
> | 3          | 33254533860 | 498696.5063226428 | 66683      |
> | 4          | 33393965595 | 501125.64656145993 | 66638      |
> | 5          | 33216885506 | 499961.32710397616 | 66439      |
> | 6          | 33380205950 | 498875.3923256599 | 66911      |
> | 7          | 33405849390 | 501093.43067788356 | 66666      |
> | 8          | 33136951190 | 498458.1044031481 | 66479      |
> | 9          | 33319291474 | 499967.5392457864 | 66643      |
> | 10         | 33339388887 | 499190.47462408233 | 66787      |
> | 11         | 33571590550 | 502095.86682194035 | 66863      |
> | 12         | 33437342090 | 501708.8141502653 | 66647      |
> | 13         | 33071800925 | 498896.453904129 | 66290      |
> | 14         | 33448664191 | 501487.4206955959 | 66699      |
> +------------+------------+------------+------------+
> [code}
> plan for the query returned the wrong result:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select 
> tt.gbyi, sum(tt.id), avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, 
> t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(gbyi=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02        SingleMergeExchange(sort0=[0 ASC])
> 01-01          SelectionVectorRemover
> 01-02            Sort(sort0=[$0], dir0=[ASC])
> 01-03              Project(gbyi=[$0], EXPR$1=[CASE(=($2, 0), null, $1)], 
> EXPR$2=[CAST(/(CastHigh(CASE(=($4, 0), null, $3)), $4)):ANY], EXPR$3=[$5])
> 01-04                HashAgg(group=[{0}], agg#0=[$SUM0($1)], 
> agg#1=[$SUM0($2)], agg#2=[$SUM0($3)], agg#3=[$SUM0($4)], EXPR$3=[$SUM0($5)])
> 01-05                  HashToRandomExchange(dist0=[[$0]])
> 02-01                    HashAgg(group=[{0}], agg#0=[$SUM0($1)], 
> agg#1=[COUNT($1)], agg#2=[$SUM0($2)], agg#3=[COUNT($2)], EXPR$3=[COUNT()])
> 02-02                      Project(gbyi=[$3], id=[$2], fl=[$1], nul=[$0])
> 02-03                        Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, 
> columns=[`gbyi`, `id`, `fl`, `nul`], 
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2309) Selecting count(), avg() of nullable columns causes wrong results

Reply via email to