[
https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sorabh Hamirwasia updated DRILL-6128:
-------------------------------------
Reviewer: Aman Sinha
> Wrong Result with Nested Loop Join
> ----------------------------------
>
> Key: DRILL-6128
> URL: https://issues.apache.org/jira/browse/DRILL-6128
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.11.0
> Reporter: Sorabh Hamirwasia
> Assignee: Sorabh Hamirwasia
> Priority: Major
> Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> Nested Loop Join produces wrong result's if there are multiple batches on the
> right side. It builds an ExapandableHyperContainer to hold all the right side
> of batches. Then for each record on left side input evaluates the condition
> with all records on right side and emit the output if condition is satisfied.
> The main loop inside
> [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106]
> call's *doEval* with correct indexes to evaluate records on both the sides.
> In generated code of *doEval* for some reason there is a right shift of 16
> done on the rightBatchIndex (sample shared below).
> {code:java}
> public boolean doEval(int leftIndex, int rightBatchIndex, int
> rightRecordIndexWithinBatch)
> throws SchemaChangeException
> {
> {
> IntHolder out3 = new IntHolder();
> {
> out3 .value = vv0 .getAccessor().get((leftIndex));
> }
> IntHolder out7 = new IntHolder();
> {
> out7 .value =
>
> vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)&
> 65535));
> }
> ......
> ......
> }{code}
>
> When the actual loop is processing second batch, inside eval method the index
> with right shift becomes 0 and it ends up evaluating condition w.r.t first
> right batch again. So if there is more than one batch (upto 65535) on right
> side doEval will always consider first batch for condition evaluation. But
> the output data will be based on correct batch so there will be issues like
> OutOfBound and WrongData. Cases can be:
> Let's say: *rightBatchIndex*: index of right batch to consider,
> *rightRecordIndexWithinBatch*: index of record in right batch at
> rightBatchIndex
> 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say
> because of filter in the operator tree). Next Right batch has > 0 data. So
> when we call doEval for second batch(*rightBatchIndex = 1*) and first record
> in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen
> using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record
> at *rightRecordIndexWithinBatch* in first batch it will throw
> *IndexOutofBoundException* since the first batch has no records.
> 2) Let's say there are 2 batches on right side. Also let's say first batch
> contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records
> (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3
> records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with
> equality condition) will end up producing 6 records instead of 3. It produces
> first 3 records based on match between left records and match in first right
> batch records. But while 2nd right batch it will evaluate id_left=id_right
> based on first batch instead and will again find matches and will produce
> another 3 records. *Example:*
> *Left Batch Data:*
>
> {code:java}
> Batch1:
> {
> "id_left": 1,
> "cost_left": 11,
> "name_left": "item11"
> }
> {
> "id_left": 2,
> "cost_left": 21,
> "name_left": "item21"
> }
> {
> "id_left": 3,
> "cost_left": 31,
> "name_left": "item31"
> }{code}
>
> *Right Batch Data:*
>
> {code:java}
> Batch 1:
> {
> "id_right": 1,
> "cost_right": 10,
> "name_right": "item1"
> }
> {
> "id_right": 2,
> "cost_right": 20,
> "name_right": "item2"
> }
> {
> "id_right": 3,
> "cost_right": 30,
> "name_right": "item3"
> }
> {code}
>
>
> {code:java}
> Batch 2:
> {
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }
> {
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }
> {
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }{code}
>
> *Produced output:*
> {code:java}
> {
> "id_left": 1,
> "cost_left": 11,
> "name_left": "item11",
> "id_right": 1,
> "cost_right": 10,
> "name_right": "item1"
> }
> {
> "id_left": 1,
> "cost_left": 11,
> "name_left": "item11",
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }
> {
> "id_left": 2,
> "cost_left": 21,
> "name_left": "item21"
> "id_right": 2,
> "cost_right": 20,
> "name_right": "item2"
> }
> {
> "id_left": 2,
> "cost_left": 21,
> "name_left": "item21"
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }
> {
> "id_left": 3,
> "cost_left": 31,
> "name_left": "item31"
> "id_right": 3,
> "cost_right": 30,
> "name_right": "item3"
> }
> {
> "id_left": 3,
> "cost_left": 31,
> "name_left": "item31"
> "id_right": 4,
> "cost_right": 40,
> "name_right": "item4"
> }{code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)