[
https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aman Sinha updated DRILL-2242:
------------------------------
Assignee: Jinfeng Ni (was: Aman Sinha)
> Wrong result (more rows) when outer query groups by subset of columns that
> inner query groups by
> ------------------------------------------------------------------------------------------------
>
> Key: DRILL-2242
> URL: https://issues.apache.org/jira/browse/DRILL-2242
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 0.7.0
> Reporter: Aman Sinha
> Assignee: Jinfeng Ni
> Priority: Critical
>
> The following query has a subquery that groups on 2 columns and outer query
> that queries on 1 of those columns. With slice_target = 1 to force
> exchanges, it produces incorrect result:
> {code}
> alter session set `planner.slice_target` = 1;
> select count(*) from
> (select l_partksy from
> (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet`
> group by l_partkey, l_suppkey)
> group by l_partkey
> );
> +------------+
> | EXPR$0 |
> +------------+
> | 6227 |
> +------------+
> 1 row selected (1.522 seconds)
> {code}
> Correct result (from Postgres):
> {code}
> count
> -------
> 2000
> (1 row)
> {code}
> The cause appears to be related to distribution trait propagation. Here's
> the EXPLAIN plan:
> {code}
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
> 00-02 UnionExchange
> 01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
> 01-02 Project($f0=[0])
> 01-03 HashAgg(group=[{0}])
> 01-04 Project(l_partkey=[$0])
> 01-05 HashAgg(group=[{0, 1}])
> 01-06 HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
> 02-01 HashAgg(group=[{0, 1}])
> 02-02 Project(l_partkey=[$1], l_suppkey=[$0])
> 02-03 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]],
> selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`,
> `l_suppkey`]]])
> {code}
> Note that the HashExchange operator 06 does a distribute on 2 columns
> l_partkey and l_suppkey in order to perform the 2phase aggregation. These are
> the group-by columns. However, in the outer query's HashAgg, there is no
> re-distribution being done. It assumes that data is already hash distributed
> on l_partkey which is not correct.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)