[jira] [Commented] (DRILL-5375) Nested loop join: return correct result for left join

Arina Ielchiieva (JIRA) Thu, 23 Mar 2017 05:46:03 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938213#comment-15938213
 ]


Arina Ielchiieva commented on DRILL-5375:
-----------------------------------------

[~zfong]
Actually, I have analyzed nested loop join and types of joins it can support 
and came to the conclusion that it should support INNER and LEFT joins only, as 
was done initially (before my changes). Basically, nested loop join is not good 
candidate for RIGHT or FULL joins because of its implementation specifics, 
planner won't pick nested loop join for such joins as it's not optimal. If we 
want support non-equi join with RIGHT and FULL joins then we need to add 
support for non-equi joins in hash and merge joins which are much better 
candidates for such types of joins.

The main idea of nested loop join is that it buffers data from right table 
(which should be small enough) and for each left table record checks if any 
right table record satisfies the join condition. Let's say we want try to allow 
RIGHT and FULL joins for nested loop join in Drill.
Pre-conditions:
2 drillbits (we would assume that join will be performed on two nodes)
2 tables:
T1
||c1||
|A|
|B|
....  any other letters except C
T2
||c1||
|A|
|B|
|C|

Query: select * from t1 right join t2 on t1.c1 =  t2.c1
Expected result:
||t1.c1||t2.c1||
|A|A|
|B|B|
|null|C|

Drill buffers T2 table on each node.
Drillbit_1 receives batch from T1 (let's imagine that out batches will contain 
only one row): *A*. It iterates over right input data and finds match: *A|A*. 
Also it marks that match for B, C was not found.
Drillbit_2 receives batch from T1: *B*. It iterates over right input data and 
finds match: *B|B*. Also it marks that match for A, C was not found.

Now to return correct RIGHT join output, we need to take statistics from two 
nodes (T2 rows that didn't find match) and merge it: *B, C + A, C => C* and 
additionally output *null|C*.
Presumably there will be one node that waits output from all nodes. 

This doesn't really coincide with Drill batch iteration approach and may be too 
tricky to implement. Thus as I have mentioned before, if we need RIGHT or FULL 
join to support non-equi joins, more correctly will be to add non-equi join 
support for hash and merge joins.




> Nested loop join: return correct result for left join
> -----------------------------------------------------
>
>                 Key: DRILL-5375
>                 URL: https://issues.apache.org/jira/browse/DRILL-5375
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>              Labels: doc-impacting
>
> Mini repro:
> 1. Create 2 Hive tables with data
> {code}
> CREATE TABLE t1 (
>   FYQ varchar(999),
>   dts varchar(999),
>   dte varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> 2016-Q1,2016-06-01,2016-09-30
> 2016-Q2,2016-09-01,2016-12-31
> 2016-Q3,2017-01-01,2017-03-31
> 2016-Q4,2017-04-01,2017-06-30
> CREATE TABLE t2 (
>   who varchar(999),
>   event varchar(999),
>   dt varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> aperson,did somthing,2017-01-06
> aperson,did somthing else,2017-01-12
> aperson,had chrsitmas,2016-12-26
> aperson,went wild,2016-01-01
> {code}
> 2. Impala Query shows correct result
> {code}
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +------------+---------+---------+-------------------+
> | dt         | fyq     | who     | event             |
> +------------+---------+---------+-------------------+
> | 2016-01-01 | NULL    | aperson | went wild         |
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas     |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing      |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +------------+---------+---------+-------------------+
> {code}
> 3. Drill query shows wrong results:
> {code}
> alter session set planner.enable_nljoin_for_scalar_only=false;
> use hive;
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +-------------+----------+----------+--------------------+
> |     dt      |   fyq    |   who    |       event        |
> +-------------+----------+----------+--------------------+
> | 2016-12-26  | 2016-Q2  | aperson  | had chrsitmas      |
> | 2017-01-06  | 2016-Q3  | aperson  | did somthing       |
> | 2017-01-12  | 2016-Q3  | aperson  | did somthing else  |
> +-------------+----------+----------+--------------------+
> 3 rows selected (2.523 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5375) Nested loop join: return correct result for left join

Reply via email to