[ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch
                bzipTest.bz2

Fixed the bzip for the test cases to have carefully crafted bad corner cases.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large 
> files. It seems as though map files are getting cut off. The maps complete 
> way too quickly and the actual row of data that pig tries to process often 
> randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map   100.00%
> 4670  0       0       4670    0       0 / 21
> reduce        57.72%
> 13    0       0       13      0       0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind  % Complete      Num Tasks       Pending Running Complete        Killed  
> Failed/Killed
> Task Attempts
> map   100.00%
> 183   0       0       15      168     54 / 22
> reduce        100.00%
> 13    0       0       0       13      0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec        
> A, 0HAW, CHIX, )
>       at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>       at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>       at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>       at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>       at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>       at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0  task_200812161759_0045_m_000007 
> tsdhb06.factset.com     FAILED  
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec       
> A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
>       at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>       at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>       at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>       at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>       at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>       at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to