[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Reed updated PIG-570: ------------------------------ Status: Patch Available (was: Open) > Large BZip files Seem to loose data in Pig > ------------------------------------------- > > Key: PIG-570 > URL: https://issues.apache.org/jira/browse/PIG-570 > Project: Pig > Issue Type: Bug > Affects Versions: 0.0.0, types_branch, 0.1.0, site > Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 > Reporter: Alex Newman > Fix For: types_branch, 0.1.0, site, 0.0.0 > > Attachments: bzipTest.bz2, PIG-570.patch > > > So I don't believe bzip2 input to pig is working, at least not with large > files. It seems as though map files are getting cut off. The maps complete > way too quickly and the actual row of data that pig tries to process often > randomly gets cut, and becomes incomplete. Here are my symptoms: > - Maps seem to be completing in a unbelievably fast rate > With uncompressed data > Status: Succeeded > Started at: Wed Dec 17 21:31:10 EST 2008 > Finished at: Wed Dec 17 22:42:09 EST 2008 > Finished in: 1hrs, 10mins, 59sec > map 100.00% > 4670 0 0 4670 0 0 / 21 > reduce 57.72% > 13 0 0 13 0 0 / 4 > With bzip compressed data > Started at: Wed Dec 17 21:17:28 EST 2008 > Failed at: Wed Dec 17 21:17:52 EST 2008 > Failed in: 24sec > Black-listed TaskTrackers: 2 > Kind % Complete Num Tasks Pending Running Complete Killed > Failed/Killed > Task Attempts > map 100.00% > 183 0 0 15 168 54 / 22 > reduce 100.00% > 13 0 0 0 13 0 / 0 > The errors we get: > ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec > A, 0HAW, CHIX, ) > at org.apache.pig.data.Tuple.getField(Tuple.java:176) > at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) > at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) > at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) > at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) > at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) > at > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) > Last 4KB > attempt_200812161759_0045_m_000007_0 task_200812161759_0045_m_000007 > tsdhb06.factset.com FAILED > java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec > A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) > at org.apache.pig.data.Tuple.getField(Tuple.java:176) > at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) > at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) > at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) > at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) > at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) > at > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.