[jira] [Commented] (DRILL-5375) Nested loop join: return correct result for left join

Arina Ielchiieva (JIRA) Wed, 22 Mar 2017 09:42:24 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936640#comment-15936640
 ]


Arina Ielchiieva commented on DRILL-5375:
-----------------------------------------

General implementation approach:
Nested loop join will handle join conditions similar way as filter operator 
does. Condition will be converted to 
[LogicalExpression|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillFilterRelBase.java#L78],
 only taking several inputs instead of one. Then this converted condition will 
be passed to NestedLoopJoinBatch where it will be converted as function inside 
of generated doEval method (similar as done in 
[FilterRecordBatch|https://github.com/apache/drill/blob/ee399317a1faa44e18aedcb11cfa5d4d5c0941aa/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/FilterRecordBatch.java#L183]).
 This will allow to process nested loop join with all types of conditions 
(equi, non-equi, mixed conditions and even self-joins). One of the challenges 
here is to solve from which input field should be taken. This is solved by 
adding for each field during materialization indication from which input it 
came from: 
[ValueVectorReadExpression|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/expr/ValueVectorReadExpression.java]
 will contain holder with batch name, row index and batch index in case we have 
ExpandableHyperContainer (as we do for right input in nested loop join). During 
code generation this batch naming will be used instead of [provided by mapping 
set|https://github.com/apache/drill/blob/46b424cbd7747685052ad512f24e38a94ac61202/exec/java-exec/src/main/java/org/apache/drill/exec/expr/EvaluationVisitor.java#L400].
 This is just extension to existing code, if batch naming is not provided, 
general approach will be used. In 
[NestedLoopJoinTemplate|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java]
 will be added abstract doEval method which will be called during comparison of 
rows from left and right inputs. Also join type will be considered. If LEFT 
join takes place, left input fields from left input rows that didn't find 
matching right input row (doEval method returns false) will be written to 
output batch, right input fields in this case for such row will be null.

> Nested loop join: return correct result for left join
> -----------------------------------------------------
>
>                 Key: DRILL-5375
>                 URL: https://issues.apache.org/jira/browse/DRILL-5375
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>
> Mini repro:
> 1. Create 2 Hive tables with data
> {code}
> CREATE TABLE t1 (
>   FYQ varchar(999),
>   dts varchar(999),
>   dte varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> 2016-Q1,2016-06-01,2016-09-30
> 2016-Q2,2016-09-01,2016-12-31
> 2016-Q3,2017-01-01,2017-03-31
> 2016-Q4,2017-04-01,2017-06-30
> CREATE TABLE t2 (
>   who varchar(999),
>   event varchar(999),
>   dt varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> aperson,did somthing,2017-01-06
> aperson,did somthing else,2017-01-12
> aperson,had chrsitmas,2016-12-26
> aperson,went wild,2016-01-01
> {code}
> 2. Impala Query shows correct result
> {code}
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +------------+---------+---------+-------------------+
> | dt         | fyq     | who     | event             |
> +------------+---------+---------+-------------------+
> | 2016-01-01 | NULL    | aperson | went wild         |
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas     |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing      |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +------------+---------+---------+-------------------+
> {code}
> 3. Drill query shows wrong results:
> {code}
> alter session set planner.enable_nljoin_for_scalar_only=false;
> use hive;
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +-------------+----------+----------+--------------------+
> |     dt      |   fyq    |   who    |       event        |
> +-------------+----------+----------+--------------------+
> | 2016-12-26  | 2016-Q2  | aperson  | had chrsitmas      |
> | 2017-01-06  | 2016-Q3  | aperson  | did somthing       |
> | 2017-01-12  | 2016-Q3  | aperson  | did somthing else  |
> +-------------+----------+----------+--------------------+
> 3 rows selected (2.523 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5375) Nested loop join: return correct result for left join

Reply via email to