[
https://issues.apache.org/jira/browse/DRILL-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647945#comment-15647945
]
Zelaine Fong commented on DRILL-5010:
-------------------------------------
I'm not sure why the use of merge join vs hash join is a bug. Whether one is
chosen over the other is a costing decision made by the optimizer.
> Equality join condition is treated as a MergeJoin and not as a HashJoin.
> ------------------------------------------------------------------------
>
> Key: DRILL-5010
> URL: https://issues.apache.org/jira/browse/DRILL-5010
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.9.0
> Reporter: Khurram Faraaz
>
> Equality join condition is treated as a MergeJoin and not as a HashJoin.
> Drill 1.9.0 git commit ID: 83513daf
> Projecting the join columns results in merge join, whereas it should be doing
> a HashJoin.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for
> . . . . . . . . . . . . . . > select t1.intKey, t2.intKey from `left.json`
> t1, `right_all_nulls.json` t2
> WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL);
> +------+------+
> | text | json |
> +------+------+
> | 00-00 Screen
> 00-01 Project(intKey=[$0], intKey0=[$1])
> 00-02 Project(intKey=[$0], intKey0=[$1])
> 00-03 MergeJoin(condition=[IS NOT DISTINCT FROM($0, $1)],
> joinType=[inner])
> 00-05 SelectionVectorRemover
> 00-07 Sort(sort0=[$0], dir0=[ASC])
> 00-09 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`intKey`],
> files=[maprfs:///tmp/left.json]]])
> 00-04 Project(intKey0=[$0])
> 00-06 SelectionVectorRemover
> 00-08 Sort(sort0=[$0], dir0=[ASC])
> 00-10 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1,
> columns=[`intKey`], files=[maprfs:///tmp/right_all_nulls.json]]])
> {noformat}
> Note that HashAgg and HashJoin were enabled.
> {noformat}
> | planner.enable_hashagg | BOOLEAN | SYSTEM | DEFAULT
> | null | null | true | null |
> | planner.enable_hashjoin | BOOLEAN | SYSTEM | DEFAULT
> | null | null | true | null |
> {noformat}
> Doing a SELECT <star> results in a HashJoin in the query plan
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> explain plan for
> . . . . . . . . . . . . . . > select * from `left.json` t1,
> `right_all_nulls.json` t2
> WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL);
> +------+------+
> | text | json |
> +------+------+
> | 00-00 Screen
> 00-01 ProjectAllowDup(*=[$0], *0=[$1])
> 00-02 Project(T46¦¦*=[$0], T47¦¦*=[$2])
> 00-03 HashJoin(condition=[IS NOT DISTINCT FROM($1, $3)],
> joinType=[inner])
> 00-04 Project(T47¦¦*=[$0], intKey0=[$1])
> 00-06 Project(T47¦¦*=[$0], intKey=[$1])
> 00-08 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1, columns=[`*`],
> files=[maprfs:///tmp/right_all_nulls.json]]])
> 00-05 Project(T46¦¦*=[$0], intKey=[$1])
> 00-07 Scan(groupscan=[EasyGroupScan
> [selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`*`],
> files=[maprfs:///tmp/left.json]]])
> {noformat}
> Data used in above queries
> {noformat}
> [root@centos-01 null_eq_joins]# cat left.json
> {
> "intKey" : 123,
> "bgintKey": 1234567,
> "strKey": "this is a test string",
> "boolKey": true,
> "fltKey": 123.786,
> "dblKey": 457.984,
> "timKey": "18:30:45",
> "dtKey": "1997-10-21",
> "tmstmpKey": "2007-04-30 13:10:02.047",
> "intrvldyKey": "P9DT38833S",
> "intrvlyrKey": "P255M"
> }
> [root@centos-01 null_equality_joins]#
> [root@centos-01 null_eq_joins]# cat right_all_nulls.json
> {
> "intKey" : null,
> "bgintKey": null,
> "strKey": null,
> "boolKey": null,
> "fltKey": null,
> "dblKey": null,
> "timKey": null,
> "dtKey": null,
> "tmstmpKey": null,
> "intrvldyKey": null,
> "intrvlyrKey": null
> }
> [root@centos-01 null_eq_joins]#
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)