[
https://issues.apache.org/jira/browse/DRILL-5375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936640#comment-15936640
]
Arina Ielchiieva commented on DRILL-5375:
-----------------------------------------
General implementation approach:
Nested loop join will handle join conditions similar way as filter operator
does. Condition will be converted to
[LogicalExpression|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillFilterRelBase.java#L78],
only taking several inputs instead of one. Then this converted condition will
be passed to NestedLoopJoinBatch where it will be converted as function inside
of generated doEval method (similar as done in
[FilterRecordBatch|https://github.com/apache/drill/blob/ee399317a1faa44e18aedcb11cfa5d4d5c0941aa/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/FilterRecordBatch.java#L183]).
This will allow to process nested loop join with all types of conditions
(equi, non-equi, mixed conditions and even self-joins). One of the challenges
here is to solve from which input field should be taken. This is solved by
adding for each field during materialization indication from which input it
came from:
[ValueVectorReadExpression|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/expr/ValueVectorReadExpression.java]
will contain holder with batch name, row index and batch index in case we have
ExpandableHyperContainer (as we do for right input in nested loop join). During
code generation this batch naming will be used instead of [provided by mapping
set|https://github.com/apache/drill/blob/46b424cbd7747685052ad512f24e38a94ac61202/exec/java-exec/src/main/java/org/apache/drill/exec/expr/EvaluationVisitor.java#L400].
This is just extension to existing code, if batch naming is not provided,
general approach will be used. In
[NestedLoopJoinTemplate|https://github.com/apache/drill/blob/5a4ad2a88331dfe7561ee76fc87e882afe170681/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java]
will be added abstract doEval method which will be called during comparison of
rows from left and right inputs. Also join type will be considered. If LEFT
join takes place, left input fields from left input rows that didn't find
matching right input row (doEval method returns false) will be written to
output batch, right input fields in this case for such row will be null.
> Nested loop join: return correct result for left join
> -----------------------------------------------------
>
> Key: DRILL-5375
> URL: https://issues.apache.org/jira/browse/DRILL-5375
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.8.0
> Reporter: Arina Ielchiieva
> Assignee: Arina Ielchiieva
>
> Mini repro:
> 1. Create 2 Hive tables with data
> {code}
> CREATE TABLE t1 (
> FYQ varchar(999),
> dts varchar(999),
> dte varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> 2016-Q1,2016-06-01,2016-09-30
> 2016-Q2,2016-09-01,2016-12-31
> 2016-Q3,2017-01-01,2017-03-31
> 2016-Q4,2017-04-01,2017-06-30
> CREATE TABLE t2 (
> who varchar(999),
> event varchar(999),
> dt varchar(999)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> aperson,did somthing,2017-01-06
> aperson,did somthing else,2017-01-12
> aperson,had chrsitmas,2016-12-26
> aperson,went wild,2016-01-01
> {code}
> 2. Impala Query shows correct result
> {code}
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +------------+---------+---------+-------------------+
> | dt | fyq | who | event |
> +------------+---------+---------+-------------------+
> | 2016-01-01 | NULL | aperson | went wild |
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +------------+---------+---------+-------------------+
> {code}
> 3. Drill query shows wrong results:
> {code}
> alter session set planner.enable_nljoin_for_scalar_only=false;
> use hive;
> select t2.dt, t1.fyq, t2.who, t2.event
> from t2
> left join t1 on t2.dt between t1.dts and t1.dte
> order by t2.dt;
> +-------------+----------+----------+--------------------+
> | dt | fyq | who | event |
> +-------------+----------+----------+--------------------+
> | 2016-12-26 | 2016-Q2 | aperson | had chrsitmas |
> | 2017-01-06 | 2016-Q3 | aperson | did somthing |
> | 2017-01-12 | 2016-Q3 | aperson | did somthing else |
> +-------------+----------+----------+--------------------+
> 3 rows selected (2.523 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)