A similar thing existed with PigStorage iirc (atleast last time I checked it a while back - unless I missed something) ... If the record boundary aligned itself with hdfs boundary, the subsequent record would get dropped by pig.

To illustrate
map1 would read until end of its block or last record boundary - whichever happens last. map2 would assume partial read by map1 and proceed to find record delimiter for its block - and read from there on. Hence if map1's record boundary and end of hdfs block coincide, map2 ends up skipping first record from its block.

Not sure if similar thing is happening here.

Regards,
Mridul

Benjamin Reed (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
------------------------------

    Attachment: PIG-570.patch

I believe the problem is due to bad position tracking. In the current version 
of the code, we chop up the input into blocks, but unfortunately when using 
bzip there are bzip block boundaries, HDFS block boundaries, and record 
boundaries. if the bzip block boundaries line up too closely, a record could 
get skipped or possibly corrupted.

i was able to reproduce a problem, hopefully it is the same as your problem in 
the attached test case.

the root cause turn out to be improper tracking of "position". if we blindly 
use the position of the underlying stream and a bzip block and HDFS block line up we may 
think that we have read the first record of the next slice when in fact we have only read 
the bzip block header.

the attached patch fixes the problem by defining the position of the stream as 
the position of the start of the current block header in the underlying stream.

Large BZip files  Seem to loose data in Pig
-------------------------------------------

                Key: PIG-570
                URL: https://issues.apache.org/jira/browse/PIG-570
            Project: Pig
         Issue Type: Bug
   Affects Versions: types_branch, 0.0.0, 0.1.0, site
        Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
           Reporter: Alex Newman
            Fix For: types_branch, 0.0.0, 0.1.0, site

        Attachments: PIG-570.patch


So I don't believe  bzip2 input to pig is working, at least not with large 
files. It seems as though map files are getting cut off. The maps complete way 
too quickly and the actual row of data that pig tries to process often randomly 
gets cut, and becomes incomplete. Here are my symptoms:
- Maps seem to be completing in a unbelievably fast rate
With uncompressed data
Status: Succeeded
Started at: Wed Dec 17 21:31:10 EST 2008
Finished at: Wed Dec 17 22:42:09 EST 2008
Finished in: 1hrs, 10mins, 59sec
map     100.00%
4670    0       0       4670    0       0 / 21
reduce  57.72%
13      0       0       13      0       0 / 4
With bzip compressed data
Started at: Wed Dec 17 21:17:28 EST 2008
Failed at: Wed Dec 17 21:17:52 EST 2008
Failed in: 24sec
Black-listed TaskTrackers: 2
Kind    % Complete      Num Tasks       Pending Running Complete        Killed  
Failed/Killed
Task Attempts
map     100.00%
183     0       0       15      168     54 / 22
reduce  100.00%
13      0       0       0       13      0 / 0
The errors we get:
ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec  A, 
0HAW, CHIX, )
        at org.apache.pig.data.Tuple.getField(Tuple.java:176)
        at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
        at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
        at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
        at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
        at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
Last 4KB
attempt_200812161759_0045_m_000007_0    task_200812161759_0045_m_000007 
tsdhb06.factset.com     FAILED  
java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 
CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
        at org.apache.pig.data.Tuple.getField(Tuple.java:176)
        at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
        at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
        at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
        at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
        at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


Reply via email to