[
https://issues.apache.org/jira/browse/DRILL-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936637#comment-15936637
]
Arina Ielchiieva commented on DRILL-5375:
-----------------------------------------
Based on my research (mailing lists, Jiras), mostly users set
planner.enable_nljoin_for_scalar_only=false to allow cross joins and
non-equality joins as Drill currently does not support any of them which is de
facto a workaround. Nested loop join currently works as follows: does cross
join and applies filler with condition (since filters (where clause) in Drill
allows non-equality filtering), basically re-writes the query. This works fine
when users have inner join or need cross join but not for left join. There is
no other way to re-write the query to push non-equi join conditions to filter
for the left join unless do self-join as was mentioned in comments (sure, not
optimal, so is not considered).
Nested loop join ignores any conditions or join types and just does cross join
unlike hash join or merge join but they only allow equi-joins. So one thing is
to teach nested loop to take into account join type and conditions as other
joins (hash and merge) but it's won't solve the problem with non-equi join
conditions. Another thing is to implement non-equi join support for nested loop
joins which has wider scope since currently Drill join implementation supports
only equi-join logic.
So the scope may look like the following:
1. Discard cross join and filtering logic for nested loop.
2. Implement support for join type and condition during nested loop join.
3. Implement non-equi join support for nested loop joins.
> Nested loop join: return correct result for left join
> -----------------------------------------------------
>
> Key: DRILL-5375
> URL: https://issues.apache.org/jira/browse/DRILL-5375
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.8.0
> Reporter: Arina Ielchiieva
> Assignee: Arina Ielchiieva
>
> Mini repro:
> 1. Create 2 Hive tables with data
> {code}
> CREATE TABLE t1 (
> FYQ varchar(999),
> dts varchar(999),
> dte varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> 2016-Q1,2016-06-01,2016-09-30
> 2016-Q2,2016-09-01,2016-12-31
> 2016-Q3,2017-01-01,2017-03-31
> 2016-Q4,2017-04-01,2017-06-30
> CREATE TABLE t2 (
> who varchar(999),
> event varchar(999),
> dt varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> aperson,did somthing,2017-01-06
> aperson,did somthing else,2017-01-12
> aperson,had chrsitmas,2016-12-26
> aperson,went wild,2016-01-01
> {code}
> 2. Impala Query shows correct result
> {code}
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +------------+---------+---------+-------------------+
> | dt | fyq | who | event |
> +------------+---------+---------+-------------------+
> | 2016-01-01 | NULL | aperson | went wild |
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +------------+---------+---------+-------------------+
> {code}
> 3. Drill query shows wrong results:
> {code}
> alter session set planner.enable_nljoin_for_scalar_only=false;
> use hive;
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +-------------+----------+----------+--------------------+
> | dt | fyq | who | event |
> +-------------+----------+----------+--------------------+
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +-------------+----------+----------+--------------------+
> 3 rows selected (2.523 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)