SequenceFile RecordReader should skip bad records
-------------------------------------------------
Key: HADOOP-3666
URL: https://issues.apache.org/jira/browse/HADOOP-3666
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.17.0
Reporter: Joydeep Sen Sarma
Currently a bad record in a sequencefile leads to entire job being failed. the
best workaround is to skip an errant file manually (by looking at what map task
failed). This is a sucky option because it's manual and because one should be
able to skip a sequencefile block (instead of entire file).
While we don't see this often (and i don't know why this corruption happened) -
here's an example stack:
Status : FAILED java.lang.NegativeArraySizeException
at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:96)
at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:75)
at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:130)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1640)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1712)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:176)
Ideally the recordreader should just skip the entire chunk if it gets an
unrecoverable error while reading.
This was the consensus in hadoop-153 as well (that data corruptions should be
handled by recordreaders) and hadoop-3144 did something similar for
textinputformat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.