[ 
https://issues.apache.org/jira/browse/PIG-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787234#action_12787234
 ] 

Pradeep Kamath commented on PIG-1131:
-------------------------------------

Not sure, but looking through the code, the issue might be present even if 
first input tuple has more fields than subsequent input tuples for either of 
the inputs in the join. The issue is because in the optimization to not send 
parts of the value present in the key, in POLocalRearrange, we further try to 
optimize by remembering which parts of the value were present in the key for 
the first value - so if the next value has different number of fields, we hit 
the exception seen in the description.

> Pig simple join does not work when it contains empty lines
> ----------------------------------------------------------
>
>                 Key: PIG-1131
>                 URL: https://issues.apache.org/jira/browse/PIG-1131
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: junk1.txt, junk2.txt, simplejoinscript.pig
>
>
> I have a simple script, which does a JOIN.
> {code}
> input1 = load '/user/viraj/junk1.txt' using PigStorage(' ');
> describe input1;
> input2 = load '/user/viraj/junk2.txt' using PigStorage('\u0001');
> describe input2;
> joineddata = JOIN input1 by $0, input2 by $0;
> describe joineddata;
> store joineddata into 'result';
> {code}
> The input data contains empty lines.  
> The join fails in the Map phase with the following error in the 
> PRLocalRearrange.java
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>       at java.util.ArrayList.get(ArrayList.java:322)
>       at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.constructLROutput(POLocalRearrange.java:464)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:360)
>       at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>       at org.apache.hadoop.mapred.Child.main(Child.java:159)
> I am surprised that the test cases did not detect this error. Could we add 
> this data which contains empty lines to the testcases?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to