[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Benjamin Reed (JIRA) Tue, 30 Dec 2008 00:23:14 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch

I believe the problem is due to bad position tracking. In the current version 
of the code, we chop up the input into blocks, but unfortunately when using 
bzip there are bzip block boundaries, HDFS block boundaries, and record 
boundaries. if the bzip block boundaries line up too closely, a record could 
get skipped or possibly corrupted.

i was able to reproduce a problem, hopefully it is the same as your problem in 
the attached test case.

the root cause turn out to be improper tracking of "position". if we blindly 
use the position of the underlying stream and a bzip block and HDFS block line 
up we may think that we have read the first record of the next slice when in 
fact we have only read the bzip block header.

the attached patch fixes the problem by defining the position of the stream as 
the position of the start of the current block header in the underlying stream.

> Large BZip files  Seem to loose data in Pig
> -------------------------------------------
>
>                 Key: PIG-570
>                 URL: https://issues.apache.org/jira/browse/PIG-570
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>            Reporter: Alex Newman
>             Fix For: types_branch, 0.0.0, 0.1.0, site
>
>         Attachments: PIG-570.patch
>
>
> So I don't believe  bzip2 input to pig is working, at least not with large 
> files. It seems as though map files are getting cut off. The maps complete 
> way too quickly and the actual row of data that pig tries to process often 
> randomly gets cut, and becomes incomplete. Here are my symptoms:
> - Maps seem to be completing in a unbelievably fast rate
> With uncompressed data
> Status: Succeeded
> Started at: Wed Dec 17 21:31:10 EST 2008
> Finished at: Wed Dec 17 22:42:09 EST 2008
> Finished in: 1hrs, 10mins, 59sec
> map   100.00%
> 4670  0       0       4670    0       0 / 21
> reduce        57.72%
> 13    0       0       13      0       0 / 4
> With bzip compressed data
> Started at: Wed Dec 17 21:17:28 EST 2008
> Failed at: Wed Dec 17 21:17:52 EST 2008
> Failed in: 24sec
> Black-listed TaskTrackers: 2
> Kind  % Complete      Num Tasks       Pending Running Complete        Killed  
> Failed/Killed
> Task Attempts
> map   100.00%
> 183   0       0       15      168     54 / 22
> reduce        100.00%
> 13    0       0       0       13      0 / 0
> The errors we get:
> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec        
> A, 0HAW, CHIX, )
>       at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>       at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>       at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>       at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>       at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>       at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Last 4KB
> attempt_200812161759_0045_m_000007_0  task_200812161759_0045_m_000007 
> tsdhb06.factset.com     FAILED  
> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec       
> A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
>       at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>       at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>       at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>       at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>       at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>       at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

Reply via email to