[
https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aman Sinha updated DRILL-2010:
------------------------------
Fix Version/s: 0.9.0
> merge join returns wrong number of rows with large dataset
> ----------------------------------------------------------
>
> Key: DRILL-2010
> URL: https://issues.apache.org/jira/browse/DRILL-2010
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 0.8.0
> Reporter: Chun Chang
> Assignee: Aman Sinha
> Priority: Critical
> Fix For: 0.9.0
>
>
> #Mon Jan 12 18:19:31 EST 2015
> git.commit.id.abbrev=5b012bf
> When data set is big enough (like larger than one batch size), merge join
> will not returns the correct number of rows. Hash join returns the correct
> number of rows. Data can be downloaded from:
> https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz
> With this dataset, the following query should return 10,000,000.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set
> `planner.enable_mergejoin` = true;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set
> `planner.enable_hashjoin` = false;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> +------------+
> | EXPR$0 |
> +------------+
> | 9046760 |
> +------------+
> 1 row selected (6.205 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set
> `planner.enable_mergejoin` = false;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set
> `planner.enable_hashjoin` = true;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from
> `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi;
> +------------+
> | EXPR$0 |
> +------------+
> | 10000000 |
> +------------+
> 1 row selected (4.453 seconds)
> {code}
> With smaller dataset, both merge and hash join returns the same correct
> number.
> physical plan for merge join:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select
> count(a.id) from `complex100k.json` a inner join `complex100k.json` b on
> a.gbyi=b.gbyi;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT($0)])
> 00-02 Project(id=[$1])
> 00-03 MergeJoin(condition=[=($0, $2)], joinType=[inner])
> 00-05 SelectionVectorRemover
> 00-07 Sort(sort0=[$0], dir0=[ASC])
> 00-09 Scan(groupscan=[EasyGroupScan
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json,
> numFiles=1, columns=[`gbyi`, `id`],
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> 00-04 Project(gbyi0=[$0])
> 00-06 SelectionVectorRemover
> 00-08 Sort(sort0=[$0], dir0=[ASC])
> 00-10 Scan(groupscan=[EasyGroupScan
> [selectionRoot=/drill/testdata/complex_type/json/complex100k.json,
> numFiles=1, columns=[`gbyi`],
> files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]])
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)