avoid bzip2 decompressor throwing exception on corrupted (prematurely
truncated) file
-------------------------------------------------------------------------------------
Key: HADOOP-3898
URL: https://issues.apache.org/jira/browse/HADOOP-3898
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Affects Versions: 0.17.1
Reporter: Suhas Gogate
running map-reduce streaming job using the bzip2 compressor, job fails with one
of either of the two following java exceptions:
This seems to happen when one of the bz2 input files is corrupted (probably
when the file is prematurely truncated). Example,
Can we fix the bzip2 decompresser so that it does not throw the above two
exceptions?
2008-07-16 07:23:39,605 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: mark/reset not supported
at java.io.InputStream.reset(InputStream.java:334)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:117)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
or
2008-07-16 20:49:28,020 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: CRC error
at
org.apache.tools.bzip2r.CBZip2InputStream.cadvise(CBZip2InputStream.java:74)
at
org.apache.tools.bzip2r.CBZip2InputStream.crcError(CBZip2InputStream.java:378)
at
org.apache.tools.bzip2r.CBZip2InputStream.endBlock(CBZip2InputStream.java:351)
at
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartA(CBZip2InputStream.java:851)
at
org.apache.tools.bzip2r.CBZip2InputStream.setupNoRandPartB(CBZip2InputStream.java:903)
at
org.apache.tools.bzip2r.CBZip2InputStream.read(CBZip2InputStream.java:240)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.readLine(Bzip2TextInputFormat.java:102)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:140)
at
org.apache.hadoop.mapred.Bzip2TextInputFormat$BZip2LineRecordReader.next(Bzip2TextInputFormat.java:34)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:158)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
Example:
$HADOOP_HOME/bin/hadoop jar -libjars $<path>/jars/bzip2.jar
$HADOOP_HOME/hadoop-streaming.jar \
-inputformat org.apache.hadoop.mapred.Bzip2TextInputFormat \
-mapper "cat" \
-reducer "cat" \
-numReduceTasks 20 \
-input '<path>/corrupt-data.bz2' \
-output bzip2_bug_example \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf stream.num.reduce.output.fields=1 \
-jobconf num.key.fields.for.partition=1
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.