[ https://issues.apache.org/jira/browse/DRILL-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366590#comment-16366590 ]
ASF GitHub Bot commented on DRILL-6128: --------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1109 > Wrong Result with Nested Loop Join > ---------------------------------- > > Key: DRILL-6128 > URL: https://issues.apache.org/jira/browse/DRILL-6128 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 1.11.0 > Reporter: Sorabh Hamirwasia > Assignee: Sorabh Hamirwasia > Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > Nested Loop Join produces wrong result's if there are multiple batches on the > right side. It builds an ExapandableHyperContainer to hold all the right side > of batches. Then for each record on left side input evaluates the condition > with all records on right side and emit the output if condition is satisfied. > The main loop inside > [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106] > call's *doEval* with correct indexes to evaluate records on both the sides. > In generated code of *doEval* for some reason there is a right shift of 16 > done on the rightBatchIndex (sample shared below). > {code:java} > public boolean doEval(int leftIndex, int rightBatchIndex, int > rightRecordIndexWithinBatch) > throws SchemaChangeException > { > { > IntHolder out3 = new IntHolder(); > { > out3 .value = vv0 .getAccessor().get((leftIndex)); > } > IntHolder out7 = new IntHolder(); > { > out7 .value = > > vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)& > 65535)); > } > ...... > ...... > }{code} > > When the actual loop is processing second batch, inside eval method the index > with right shift becomes 0 and it ends up evaluating condition w.r.t first > right batch again. So if there is more than one batch (upto 65535) on right > side doEval will always consider first batch for condition evaluation. But > the output data will be based on correct batch so there will be issues like > OutOfBound and WrongData. Cases can be: > Let's say: *rightBatchIndex*: index of right batch to consider, > *rightRecordIndexWithinBatch*: index of record in right batch at > rightBatchIndex > 1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say > because of filter in the operator tree). Next Right batch has > 0 data. So > when we call doEval for second batch(*rightBatchIndex = 1*) and first record > in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen > using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record > at *rightRecordIndexWithinBatch* in first batch it will throw > *IndexOutofBoundException* since the first batch has no records. > 2) Let's say there are 2 batches on right side. Also let's say first batch > contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records > (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 > records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with > equality condition) will end up producing 6 records instead of 3. It produces > first 3 records based on match between left records and match in first right > batch records. But while 2nd right batch it will evaluate id_left=id_right > based on first batch instead and will again find matches and will produce > another 3 records. *Example:* > *Left Batch Data:* > > {code:java} > Batch1: > { > "id_left": 1, > "cost_left": 11, > "name_left": "item11" > } > { > "id_left": 2, > "cost_left": 21, > "name_left": "item21" > } > { > "id_left": 3, > "cost_left": 31, > "name_left": "item31" > }{code} > > *Right Batch Data:* > > {code:java} > Batch 1: > { > "id_right": 1, > "cost_right": 10, > "name_right": "item1" > } > { > "id_right": 2, > "cost_right": 20, > "name_right": "item2" > } > { > "id_right": 3, > "cost_right": 30, > "name_right": "item3" > } > {code} > > > {code:java} > Batch 2: > { > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > } > { > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > } > { > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > }{code} > > *Produced output:* > {code:java} > { > "id_left": 1, > "cost_left": 11, > "name_left": "item11", > "id_right": 1, > "cost_right": 10, > "name_right": "item1" > } > { > "id_left": 1, > "cost_left": 11, > "name_left": "item11", > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > } > { > "id_left": 2, > "cost_left": 21, > "name_left": "item21" > "id_right": 2, > "cost_right": 20, > "name_right": "item2" > } > { > "id_left": 2, > "cost_left": 21, > "name_left": "item21" > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > } > { > "id_left": 3, > "cost_left": 31, > "name_left": "item31" > "id_right": 3, > "cost_right": 30, > "name_right": "item3" > } > { > "id_left": 3, > "cost_left": 31, > "name_left": "item31" > "id_right": 4, > "cost_right": 40, > "name_right": "item4" > }{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)