[jira] [Comment Edited] (DRILL-6128) Wrong Result with Nested Loop Join

Sorabh Hamirwasia (JIRA) Wed, 31 Jan 2018 19:37:31 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347960#comment-16347960
 ]


Sorabh Hamirwasia edited comment on DRILL-6128 at 2/1/18 3:36 AM:
------------------------------------------------------------------

I did some more investigation on this and looks like we recently added 
generated code for *doEval* method of NestedLoopJoin as part of DRILL-5375. The 
right sift happens inside *doEval* method, reason being it identifies the right 
side container as HyperContainer and does right shift on the index to get the 
batchIndex. This will only work if it's insured by the creator of Expandable 
HyperContainer to fully pack the value vectors inside it and then just use So 
far other operators using the ExpandableHyperContainers are HashJoin, 
MergingRecordBatch and Sort/TopN (using PriorityQueue).

>From discussion with [~amansinha100] it looks like while building HashTable we 
>use HyperContainers of BatchHolders, but we make sure that each BatchHolder is 
>fully filled before adding another one in the container. Hence it is working 
>fine with respect to generated code accessing records from it. It would be 
>good to make sure PriorityQueue is also doing something like this.

*Current Nested Loop Behavior:*

NestedLoop Join adds the right side input batches inside HyperContainer 
(rightContainer) without ensuring it's fully packed. It also maintains a list 
of record counts in each batch in rightCounts. Later these are passed to 
generated code using BatchReference. During EvaluationVisitor, when it sees the 
rightContainer as HyperContainer it does right shift on passed batchIndex.

*Currently to fix this issue we have following ways:*
 1) Make sure all the operators using Hyper container always fully pack it 
(which looks to be the case today). In which case Nested Loop Join has to do 
something similar like while adding batches inside rightContainer make sure the 
batch is fully packed.

2) Operators which are not fully packing the hyper container should use 
BatchReference for generated code and should also keep track of list of record 
counts in each batch along with hyper container. Whenever only index of the 
record is passed to the generated code it should generate index like: 
rightIndex = batchIndex << 16 + recordWithinBatchIndex. When both batch index 
and record index is passed separately then it should generate batchIndex = 
batchIndex << 16 and pass recordWithintBatchIndex separately. Later is the case 
for NestedLoopJoin current Implementation.


was (Author: shamirwasia):
I did some more investigation on this and looks like we recently added 
generated code for *doEval* method of NestedLoopJoin. The right sift happens 
inside *doEval* method, reason being it identifies the right side container as 
HyperContainer and does right shift on the index to get the batchIndex. This 
will only work if it's insured by the creator of Expandable HyperContainer to 
fully pack the value vectors inside it and then just use So far other operators 
using the ExpandableHyperContainers are HashJoin, MergingRecordBatch and 
Sort/TopN (using PriorityQueue).

>From discussion with [~amansinha100] it looks like while building HashTable we 
>use HyperContainers of BatchHolders, but we make sure that each BatchHolder is 
>fully filled before adding another one in the container. Hence it is working 
>fine with respect to generated code accessing records from it. It would be 
>good to make sure PriorityQueue is also doing something like this.

*Current Nested Loop Behavior:*

NestedLoop Join adds the right side input batches inside HyperContainer 
(rightContainer) without ensuring it's fully packed. It also maintains a list 
of record counts in each batch in rightCounts. Later these are passed to 
generated code using BatchReference. During EvaluationVisitor, when it sees the 
rightContainer as HyperContainer it does right shift on passed batchIndex.


*Currently to fix this issue we have following ways:*
1) Make sure all the operators using Hyper container always fully pack it 
(which looks to be the case today). In which case Nested Loop Join has to do 
something similar like while adding batches inside rightContainer make sure the 
batch is fully packed.

2) Operators which are not fully packing the hyper container should use 
BatchReference for generated code and should also keep track of list of record 
counts in each batch along with hyper container. Whenever only index of the 
record is passed to the generated code it should generate index like: 
rightIndex = batchIndex << 16 + recordWithinBatchIndex. When both batch index 
and record index is passed separately then it should generate batchIndex = 
batchIndex << 16 and pass recordWithintBatchIndex separately. Later is the case 
for NestedLoopJoin current Implementation.

> Wrong Result with Nested Loop Join
> ----------------------------------
>
>                 Key: DRILL-6128
>                 URL: https://issues.apache.org/jira/browse/DRILL-6128
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Sorabh Hamirwasia
>            Assignee: Sorabh Hamirwasia
>            Priority: Major
>
> Nested Loop Join produces wrong result's if there are multiple batches on the 
> right side. It builds an ExapandableHyperContainer to hold all the right side 
> of batches. Then for each record on left side input evaluates the condition 
> with all records on right side and emit the output if condition is satisfied. 
> The main loop inside 
> [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106]
>  call's *doEval* with correct indexes to evaluate records on both the sides. 
> In generated code of *doEval* for some reason there is a right shift of 16 
> done on the rightBatchIndex (sample shared below).
> {code:java}
> public boolean doEval(int leftIndex, int rightBatchIndex, int 
> rightRecordIndexWithinBatch)
>  throws SchemaChangeException
> {
>   {
>    IntHolder out3 = new IntHolder();
>    {
>      out3 .value = vv0 .getAccessor().get((leftIndex));
>    }
>    IntHolder out7 = new IntHolder();
>    {
>      out7 .value =  
>  
> vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)&
>  65535));
>    }
> ......
> ......
> }{code}
>  
> When the actual loop is processing second batch, inside eval method the index 
> with right shift becomes 0 and it ends up evaluating condition w.r.t first 
> right batch again. So if there is more than one batch (upto 65535) on right 
> side doEval will always consider first batch for condition evaluation. But 
> the output data will be based on correct batch so there will be issues like 
> OutOfBound and WrongData. Cases can be:
> Let's say: *rightBatchIndex*: index of right batch to consider, 
> *rightRecordIndexWithinBatch*: index of record in right batch at 
> rightBatchIndex
> 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say 
> because of filter in the operator tree). Next Right batch has > 0 data. So 
> when we call doEval for second batch(*rightBatchIndex = 1*) and first record 
> in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen 
> using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record 
> at *rightRecordIndexWithinBatch* in first batch it will throw 
> *IndexOutofBoundException* since the first batch has no records.
> 2) Let's say there are 2 batches on right side. Also let's say first batch 
> contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records 
> (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 
> records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with 
> equality condition) will end up producing 6 records instead of 3. It produces 
> first 3 records based on match between left records and match in first right 
> batch records. But while 2nd right batch it will evaluate id_left=id_right 
> based on first batch instead and will again find matches and will produce 
> another 3 records. *Example:*
> *Left Batch Data:*
>  
> {code:java}
> Batch1:
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
> }{code}
>  
> *Right Batch Data:*
>  
> {code:java}
> Batch 1:
> {
>  "id_right": 1,
>  "cost_right": 10,
>  "name_right": "item1"
> }
> {
>  "id_right": 2,
>  "cost_right": 20,
>  "name_right": "item2"
> }
> {
>  "id_right": 3,
>  "cost_right": 30,
>  "name_right": "item3"
> }
> {code}
>  
>  
> {code:java}
> Batch 2:
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }{code}
>  
> *Produced output:*
> {code:java}
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11",
>  "id_right": 1,
>  "cost_right": 10,
>  "name_right": "item1"
> }
> {
>  "id_left": 1,
>  "cost_left": 11,
>  "name_left": "item11",
>  "id_right": 4,
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
>  "id_right": 2, 
>  "cost_right": 20,
>  "name_right": "item2"
> }
> {
>  "id_left": 2,
>  "cost_left": 21,
>  "name_left": "item21"
>  "id_right": 4, 
>  "cost_right": 40,
>  "name_right": "item4"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
>  "id_right": 3, 
>  "cost_right": 30,
>  "name_right": "item3"
> }
> {
>  "id_left": 3,
>  "cost_left": 31,
>  "name_left": "item31"
>  "id_right": 4, 
>  "cost_right": 40,
>  "name_right": "item4"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DRILL-6128) Wrong Result with Nested Loop Join

Reply via email to