[
https://issues.apache.org/jira/browse/DRILL-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337236#comment-14337236
]
Chun Chang commented on DRILL-2309:
-----------------------------------
I think this issue might be because we reuse count value. If an aggregate
function already generated count, it's possible we reuse it as an optimization.
But it does not take into consideration that a column might contain nulls and
null should not be counted.
{code}
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul),
avg(tt.fl) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt
group by tt.gbyi order by tt.gbyi;
+------------+------------+------------+
| gbyi | EXPR$1 | EXPR$2 |
+------------+------------+------------+
| 0 | 66943 | 499613.0956877819 |
| 1 | 66318 | 500760.0252919893 |
| 2 | 66994 | 498091.82200273 |
| 3 | 66683 | 498696.5063226428 |
| 4 | 66638 | 501125.64656145993 |
| 5 | 66439 | 499961.32710397616 |
| 6 | 66911 | 498875.3923256599 |
| 7 | 66666 | 501093.43067788356 |
| 8 | 66479 | 498458.1044031481 |
| 9 | 66643 | 499967.5392457864 |
| 10 | 66787 | 499190.47462408233 |
| 11 | 66863 | 502095.86682194035 |
| 12 | 66647 | 501708.8141502653 |
| 13 | 66290 | 498896.453904129 |
| 14 | 66699 | 501487.4206955959 |
+------------+------------+------------+
15 rows selected (14.17 seconds)
0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id),
count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt
group by tt.gbyi order by tt.gbyi;
+------------+------------+------------+
| gbyi | EXPR$1 | EXPR$2 |
+------------+------------+------------+
| 0 | 33445554017 | 33580 |
| 1 | 33209358334 | 33317 |
| 2 | 33369118041 | 33438 |
| 3 | 33254533860 | 33535 |
| 4 | 33393965595 | 33369 |
| 5 | 33216885506 | 32990 |
| 6 | 33380205950 | 33661 |
| 7 | 33405849390 | 33130 |
| 8 | 33136951190 | 33362 |
| 9 | 33319291474 | 33364 |
| 10 | 33339388887 | 33229 |
| 11 | 33571590550 | 33567 |
| 12 | 33437342090 | 33379 |
| 13 | 33071800925 | 33045 |
| 14 | 33448664191 | 33305 |
+------------+------------+------------+
15 rows selected (13.909 seconds)
{code}
> 'null' is counted with subquery
> -------------------------------
>
> Key: DRILL-2309
> URL: https://issues.apache.org/jira/browse/DRILL-2309
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Data Types
> Affects Versions: 0.8.0
> Reporter: Chun Chang
> Assignee: Daniel Barclay (Drill)
> Priority: Critical
>
> #Thu Feb 19 18:40:10 EST 2015
> git.commit.id.abbrev=1ceddff
> The following query returns correct count involving columns that contains
> null value.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, count(tt.nul)
> from (select t.id, t.gbyi, t.fl, t.nul from `complex.json` t) tt group by
> tt.gbyi order by tt.gbyi;
> +------------+------------+
> | gbyi | EXPR$1 |
> +------------+------------+
> | 0 | 33580 |
> | 1 | 33317 |
> | 2 | 33438 |
> | 3 | 33535 |
> | 4 | 33369 |
> | 5 | 32990 |
> | 6 | 33661 |
> | 7 | 33130 |
> | 8 | 33362 |
> | 9 | 33364 |
> | 10 | 33229 |
> | 11 | 33567 |
> | 12 | 33379 |
> | 13 | 33045 |
> | 14 | 33305 |
> +------------+------------+
> {code}
> But if you add more aggregation to the query, the returned count is wrong
> (pay attention to the last column).
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select tt.gbyi, sum(tt.id),
> avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi, t.fl, t.nul from
> `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
> +------------+------------+------------+------------+
> | gbyi | EXPR$1 | EXPR$2 | EXPR$3 |
> +------------+------------+------------+------------+
> | 0 | 33445554017 | 499613.0956877819 | 66943 |
> | 1 | 33209358334 | 500760.0252919893 | 66318 |
> | 2 | 33369118041 | 498091.82200273 | 66994 |
> | 3 | 33254533860 | 498696.5063226428 | 66683 |
> | 4 | 33393965595 | 501125.64656145993 | 66638 |
> | 5 | 33216885506 | 499961.32710397616 | 66439 |
> | 6 | 33380205950 | 498875.3923256599 | 66911 |
> | 7 | 33405849390 | 501093.43067788356 | 66666 |
> | 8 | 33136951190 | 498458.1044031481 | 66479 |
> | 9 | 33319291474 | 499967.5392457864 | 66643 |
> | 10 | 33339388887 | 499190.47462408233 | 66787 |
> | 11 | 33571590550 | 502095.86682194035 | 66863 |
> | 12 | 33437342090 | 501708.8141502653 | 66647 |
> | 13 | 33071800925 | 498896.453904129 | 66290 |
> | 14 | 33448664191 | 501487.4206955959 | 66699 |
> +------------+------------+------------+------------+
> [code}
> plan for the query returned the wrong result:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select
> tt.gbyi, sum(tt.id), avg(tt.fl), count(tt.nul) from (select t.id, t.gbyi,
> t.fl, t.nul from `complex.json` t) tt group by tt.gbyi order by tt.gbyi;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(gbyi=[$0], EXPR$1=[$1], EXPR$2=[$2], EXPR$3=[$3])
> 00-02 SingleMergeExchange(sort0=[0 ASC])
> 01-01 SelectionVectorRemover
> 01-02 Sort(sort0=[$0], dir0=[ASC])
> 01-03 Project(gbyi=[$0], EXPR$1=[CASE(=($2, 0), null, $1)],
> EXPR$2=[CAST(/(CastHigh(CASE(=($4, 0), null, $3)), $4)):ANY], EXPR$3=[$5])
> 01-04 HashAgg(group=[{0}], agg#0=[$SUM0($1)],
> agg#1=[$SUM0($2)], agg#2=[$SUM0($3)], agg#3=[$SUM0($4)], EXPR$3=[$SUM0($5)])
> 01-05 HashToRandomExchange(dist0=[[$0]])
> 02-01 HashAgg(group=[{0}], agg#0=[$SUM0($1)],
> agg#1=[COUNT($1)], agg#2=[$SUM0($2)], agg#3=[COUNT($2)], EXPR$3=[COUNT()])
> 02-02 Project(gbyi=[$3], id=[$2], fl=[$1], nul=[$0])
> 02-03 Scan(groupscan=[EasyGroupScan
> [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1,
> columns=[`gbyi`, `id`, `fl`, `nul`],
> files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)