[ 
https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-6128:
-------------------------------------
    Reviewer: Aman Sinha

> Wrong Result with Nested Loop Join
> ----------------------------------
>
>                 Key: DRILL-6128
>                 URL: https://issues.apache.org/jira/browse/DRILL-6128
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Sorabh Hamirwasia
>            Assignee: Sorabh Hamirwasia
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.13.0
>
>
> Nested Loop Join produces wrong result's if there are multiple batches on the 
> right side. It builds an ExapandableHyperContainer to hold all the right side 
> of batches. Then for each record on left side input evaluates the condition 
> with all records on right side and emit the output if condition is satisfied. 
> The main loop inside 
> [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106]
>  call's *doEval* with correct indexes to evaluate records on both the sides. 
> In generated code of *doEval* for some reason there is a right shift of 16 
> done on the rightBatchIndex (sample shared below).
> {code:java}
> public boolean doEval(int leftIndex, int rightBatchIndex, int 
> rightRecordIndexWithinBatch)
>  throws SchemaChangeException
> {
>   {
>    IntHolder out3 = new IntHolder();
>    {
>      out3 .value = vv0 .getAccessor().get((leftIndex));
>    }
>    IntHolder out7 = new IntHolder();
>    {
>      out7 .value =  
>  
> vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)&
>  65535));
>    }
> ......
> ......
> }{code}
>  
> When the actual loop is processing second batch, inside eval method the index 
> with right shift becomes 0 and it ends up evaluating condition w.r.t first 
> right batch again. So if there is more than one batch (upto 65535) on right 
> side doEval will always consider first batch for condition evaluation. But 
> the output data will be based on correct batch so there will be issues like 
> OutOfBound and WrongData. Cases can be:
> Let's say: *rightBatchIndex*: index of right batch to consider, 
> *rightRecordIndexWithinBatch*: index of record in right batch at 
> rightBatchIndex
> 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say 
> because of filter in the operator tree). Next Right batch has > 0 data. So 
> when we call doEval for second batch(*rightBatchIndex = 1*) and first record 
> in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen 
> using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record 
> at *rightRecordIndexWithinBatch* in first batch it will throw 
> *IndexOutofBoundException* since the first batch has no records.
> 2) Let's say there are 2 batches on right side. Also let's say first batch 
> contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records 
> (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 
> records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with 
> equality condition) will end up producing 6 records instead of 3. It produces 
> first 3 records based on match between left records and match in first right 
> batch records. But while 2nd right batch it will evaluate id_left=id_right 
> based on first batch instead and will again find matches and will produce 
> another 3 records. *Example:*
> *Left Batch Data:*
>  
> {code:java}
> Batch1:
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
> }{code}
>  
> *Right Batch Data:*
>  
> {code:java}
> Batch 1:
> {
>  "id_right": 1,
>  "cost_right": 10,
>  "name_right": "item1"
> }
> {
>  "id_right": 2,
>  "cost_right": 20,
>  "name_right": "item2"
> }
> {
>  "id_right": 3,
>  "cost_right": 30,
>  "name_right": "item3"
> }
> {code}
>  
>  
> {code:java}
> Batch 2:
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }{code}
>  
> *Produced output:*
> {code:java}
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11",
>  "id_right": 1,
>  "cost_right": 10,
>  "name_right": "item1"
> }
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11",
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
>  "id_right": 2, 
>  "cost_right": 20,
>  "name_right": "item2"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
>  "id_right": 4, 
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
>  "id_right": 3, 
>  "cost_right": 30,
>  "name_right": "item3"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
>  "id_right": 4, 
>  "cost_right": 40,
>  "name_right": "item4"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to