[jira] Commented: (HADOOP-3666) SequenceFile RecordReader should skip bad records

Sharad Agarwal (JIRA) Thu, 03 Jul 2008 23:02:22 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610458#action_12610458
 ]


Sharad Agarwal commented on HADOOP-3666:
----------------------------------------

{quote}One problem with his proposal is how will the RecordReader differentiate 
between the first and second next() call? {quote}
Does RecordReader really need to differentiate? When next() is called the first 
time, the RecordReader would skip to the sane record boundary, BEFORE throwing 
an Exception (a subclass of IOException, something like 
SkippedRecordException), so that framework knows that the record has been 
skipped. Calling next() again would read the next() record.
This way we are also not forcing all RecordReaders to implement this feature. 

{quote} One simple method to integrate this with the policy framework would be 
for the Recordreader to export an error counter (as an additional interface). 
{quote}
 That would be the better way. But given that lot of user code may get 
impacted, we can try to avoid interface change as far as possible.

makes sense?

> SequenceFile RecordReader should skip bad records
> -------------------------------------------------
>
>                 Key: HADOOP-3666
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3666
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Joydeep Sen Sarma
>
> Currently a bad record in a sequencefile leads to entire job being failed. 
> the best workaround is to skip an errant file manually (by looking at what 
> map task failed).  This is a sucky option because it's manual and because one 
> should be able to skip a sequencefile block (instead of entire file).
> While we don't see this often (and i don't know why this corruption happened) 
> - here's an example stack:
> Status : FAILED java.lang.NegativeArraySizeException
>       at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:96)
>       at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:75)
>       at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:130)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1640)
>       at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1712)
>       at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
>       at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:176)
> Ideally the recordreader should just skip the entire chunk if it gets an 
> unrecoverable error while reading.
> This was the consensus in hadoop-153 as well (that data corruptions should be 
> handled by recordreaders) and hadoop-3144 did something similar for 
> textinputformat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3666) SequenceFile RecordReader should skip bad records

Reply via email to