[
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Reed updated PIG-570:
------------------------------
Attachment: PIG-570.patch
bzipTest.bz2
Fixed the bzip for the test cases to have carefully crafted bad corner cases.
> Large BZip files Seem to loose data in Pig
> -------------------------------------------
>
> Key: PIG-570
> URL: https://issues.apache.org/jira/browse/PIG-570
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch, 0.0.0, 0.1.0, site
> Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
> Reporter: Alex Newman
> Fix For: types_branch, 0.0.0, 0.1.0, site
>
> Attachments: bzipTest.bz2, PIG-570.patch
>
>
> So I don't believe bzip2 input to pig is working, at least not with large
> files. It seems as though map files are getting cut off. The maps complete
> way too quickly and the actual row of data that pig tries to process often
> randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map 100.00%
> 4670 0 0 4670 0 0 / 21
> reduce 57.72%
> 13 0 0 13 0 0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind % Complete Num Tasks Pending Running Complete Killed
> Failed/Killed
> Task Attempts
> map 100.00%
> 183 0 0 15 168 54 / 22
> reduce 100.00%
> 13 0 0 0 13 0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
> A, 0HAW, CHIX, )
> at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> at
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0 task_200812161759_0045_m_000007
> tsdhb06.factset.com FAILED
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
> A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
> at org.apache.pig.data.Tuple.getField(Tuple.java:176)
> at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
> at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
> at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
> at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
> at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
> at
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.