[jira] [Commented] (DRILL-2092) Incorrect result with count distinct and sum aggregates

Aman Sinha (JIRA) Thu, 29 Jan 2015 09:02:10 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297132#comment-14297132
 ]


Aman Sinha commented on DRILL-2092:
-----------------------------------

Drill does not have an implementation of 'IS_NOT_DISTINCT_FROM'  operation 
(this is the join condition - see the Explain plan above) and it looks like it 
falls through the other checks for non-equality join, so the hash join operator 
assumes it is an equality join.  An implementation of both IS_DISTINCT_FROM and 
IS_NOT_DISTINCT_FROM would be needed to ensure null comparisons are handled 
correctly.  

> Incorrect result with count distinct and sum aggregates
> -------------------------------------------------------
>
>                 Key: DRILL-2092
>                 URL: https://issues.apache.org/jira/browse/DRILL-2092
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Jinfeng Ni
>            Priority: Critical
>
> test.json
> {code}
> { "a1" : 10 , "b1" : 10 }
> { "a1" : 20 , "b1" : 20 }
> { "a1" : 20 , "b1" : 20}
> { "a1" : 30 , "b1" : 30 }
> { "a1" : null , "b1": null}
> {code}
> {code}
> 0: jdbc:drill:schema=dfs> select a1, count(distinct a1) from `test.json` 
> group by a1;
> +------------+------------+
> |     a1     |   EXPR$1   |
> +------------+------------+
> | 10         | 1          |
> | 20         | 1          |
> | 30         | 1          |
> | null       | 0          |
> +------------+------------+
> 4 rows selected (0.096 seconds)
> {code}
> If  I add sum on the same column, I  get wrong result (null group is gone):
> {code}
> 0: jdbc:drill:schema=dfs> select a1, count(distinct a1), sum(a1) from 
> `test.json` group by a1;
> +------------+------------+------------+
> |     a1     |   EXPR$1   |   EXPR$2   |
> +------------+------------+------------+
> | 10         | 1          | 10         |
> | 20         | 1          | 40         |
> | 30         | 1          | 30         |
> +------------+------------+------------+
> 3 rows selected (0.137 seconds)
> {code}
> Non-distinct count works correctly:
> {code}
> 0: jdbc:drill:schema=dfs> select a1, count(a1), sum(a1) from `test.json` 
> group by a1;
> +------------+------------+------------+
> |     a1     |   EXPR$1   |   EXPR$2   |
> +------------+------------+------------+
> | 10         | 1          | 10         |
> | 20         | 2          | 40         |
> | 30         | 1          | 30         |
> | null       | 0          | null       |
> +------------+------------+------------+
> 4 rows selected (0.187 seconds)
> {code}
> Plan for the query with the wrong result:
> {code}
> 00-01      Project(a1=[$0], EXPR$1=[$1], EXPR$2=[$2])
> 00-02        Project(a1=[$0], EXPR$1=[$3], EXPR$2=[$1])
> 00-03          HashJoin(condition=[IS NOT DISTINCT FROM($0, $2)], 
> joinType=[inner])
> 00-05            HashAgg(group=[{0}], EXPR$2=[SUM($0)])
> 00-07              Scan(groupscan=[EasyGroupScan [selectionRoot=/test.json, 
> numFiles=1, columns=[`a1`], files=[maprfs:/test.json]]])
> 00-04            Project(a10=[$0], EXPR$1=[$1])
> 00-06              HashAgg(group=[{0}], EXPR$1=[COUNT($0)])
> 00-08                HashAgg(group=[{0}])
> 00-09                  Scan(groupscan=[EasyGroupScan 
> [selectionRoot=/test.json, numFiles=1, columns=[`a1`], 
> files=[maprfs:/test.json]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2092) Incorrect result with count distinct and sum aggregates

Reply via email to