[ 
https://issues.apache.org/jira/browse/PIG-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499345#comment-13499345
 ] 

Koji Noguchi commented on PIG-3051:
-----------------------------------

bq. I couldn't reproduce this with short pig code (due to ColumnPruning somehow 
not happening when shortened), 

Learned that columnprune does not kick in unless there is column-or-map to 
prune inside load. (even though columnprune does more than just pruning at the 
load part.)

By adding one extra line to force columnpruning, i was able to reproduce this 
issue.  First example hitting IndexOutOfBoundsException and second one 
producing incorrect result.

{noformat}
% cat test/pig-3051-1.pig 
A = load 'a.txt' using PigStorage() as (a1:chararray, a2:chararray, 
a3:chararray, a4:chararray);
B = foreach A generate a2,a3,a4;  --to force columnprune algo to cover
G = order B by a4;
U1 = limit G 3;
U2 = foreach U1 generate a4;
store G into 'g' using PigStorage();
store U2 into 'u2' using PigStorage(); 
% cat a.txt
1       2       3       4
2       3       4       1
3       4       1       2
4       1       2       3
% pig -x local test/pig-3051-1.pig 
...
fails with Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
{noformat}

Now adding extra 2 columns, job finishes but result incorrect.

{noformat}
% cat test/pig-3051-2.pig 
A = load 'b.txt' using PigStorage() as (a1:chararray, a2:chararray, 
a3:chararray, a4:chararray, a5:chararray, a6:chararray);
B = foreach A generate a2,a3,a4,a5,a6;  --to force columnprune algo to cover
G = order B by a4;
U1 = limit G 4;
U2 = foreach U1 generate a4,a5,a6;
store G into 'g' using PigStorage();
store U2 into 'u2' using PigStorage(); 
% cat b.txt 
1       2       3       4       5       6
2       3       4       5       6       1
3       4       5       6       1       2
4       5       6       1       2       3
5       6       1       2       3       4
6       1       2       3       4       5
% pig -x local test/pig-3051-2.pig 
...
success
% cat u2/part-r-00000 
5       6       1
6       1       2
1       2       3
2       3       4
{noformat}

And last, taking out store G (to take out LOSplit).  This produces a correct 
output.
{noformat}
% cat test/pig-3051-3.pig A = load 'b.txt' using PigStorage() as (a1:chararray, 
a2:chararray, a3:chararray, a4:chararray, a5:chararray, a6:chararray);
B = foreach A generate a2,a3,a4,a5,a6;  --to force columnprune algo to cover
G = order B by a4;
U1 = limit G 4;
U2 = foreach U1 generate a4,a5,a6;
--store G into 'g' using PigStorage();
store U2 into 'u2' using PigStorage(); 

% pig -x local test/pig-3051-3.pig 
... Success.

% cat u2/part-r-00000 
1       2       3
2       3       4
3       4       5
4       5       6
% 
{noformat}

Also tested the patch(pig-3051-v1-withouttest.txt) and it does fix the 
incorrect result case.
                
> java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
> ColumnPruning
> --------------------------------------------------------------------------------
>
>                 Key: PIG-3051
>                 URL: https://issues.apache.org/jira/browse/PIG-3051
>             Project: Pig
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10.0, 0.11
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>         Attachments: pig-3051-v1-withouttest.txt
>
>
> Had a user hitting 
> "Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1" error 
> when he had multiple stores and limit in his code.
> I couldn't reproduce this with short pig code (due to ColumnPruning somehow 
> not happening when shortened), but here's a snippet. 
> {noformat}
> ...
> G3 = FOREACH G2 GENERATE sortCol, FLATTEN(group) as label, (long)COUNT(G1) as 
> cnt;
> G4 = ORDER G3 BY cnt DESC PARALLEL 25;
> ONEROW = LIMIT G4 1;
> U1 = FOREACH ONEROW GENERATE 3 as sortcol, 'somelabel' as label, cnt;
> store U1 into 'u1' using PigStorage();
> store G4 into 'g4' using PigStorage();
> {noformat}
> With '-t ColumnMapKeyPrune', job didn't hit the error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to