[jira] [Updated] (DRILL-2242) Wrong result (more rows) when outer query groups by subset of columns that inner query groups by

Aman Sinha (JIRA) Tue, 17 Feb 2015 14:02:24 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aman Sinha updated DRILL-2242:
------------------------------
    Assignee: Jinfeng Ni  (was: Aman Sinha)

> Wrong result (more rows) when outer query groups by subset of columns that 
> inner query groups by
> ------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-2242
>                 URL: https://issues.apache.org/jira/browse/DRILL-2242
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0
>            Reporter: Aman Sinha
>            Assignee: Jinfeng Ni
>            Priority: Critical
>
> The following query has a subquery that groups on 2 columns and outer query 
> that queries on 1 of those columns.  With slice_target = 1 to force 
> exchanges, it produces incorrect result: 
> {code}
> alter session set `planner.slice_target` = 1;
> select count(*) from 
>  (select l_partksy from
>    (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
>       group by l_partkey, l_suppkey) 
>    group by l_partkey
>  );
> +------------+
> |   EXPR$0   |
> +------------+
> | 6227       |
> +------------+
> 1 row selected (1.522 seconds)
> {code}
> Correct result (from Postgres): 
> {code}
>  count
> -------
>   2000
> (1 row)
> {code}
> The cause appears to be related to distribution trait propagation.  Here's 
> the EXPLAIN plan: 
> {code}
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> 00-02        UnionExchange
> 01-01          StreamAgg(group=[{}], EXPR$0=[COUNT()])
> 01-02            Project($f0=[0])
> 01-03              HashAgg(group=[{0}])
> 01-04                Project(l_partkey=[$0])
> 01-05                  HashAgg(group=[{0, 1}])
> 01-06                    HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 02-01                      HashAgg(group=[{0, 1}])
> 02-02                        Project(l_partkey=[$1], l_suppkey=[$0])
> 02-03                          Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], 
> selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, 
> `l_suppkey`]]])
> {code}
> Note that the HashExchange operator 06 does a distribute on 2 columns 
> l_partkey and l_suppkey in order to perform the 2phase aggregation. These are 
> the group-by columns.  However, in the outer query's HashAgg, there is no 
> re-distribution being done.  It assumes that data is already hash distributed 
> on l_partkey which is not correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-2242) Wrong result (more rows) when outer query groups by subset of columns that inner query groups by

Reply via email to