running a job on my 5 node cluster, i get these intermittent exceptions in
my logs:

java.io.IOException: incorrect data check
        at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
        at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
        at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
        at java.io.InputStream.read(InputStream.java:89)
        at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
        at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
        at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084


they occur accross all the nodes, but i can't figure out which file is
causing the problem.  i'm working on the assumption it's a specific file
because it's precisely the same error that occurs on each node.  i've
scoured the logs and can't find any reference to which file caused the
hiccup.  but this is causing the job to fail.  other files are processed
without a problem.  the files are 720 .gz files, ~100mb each.  other files
are processed on each node without a problem.  i'm in the middle testing the
.gz files, but i don't think the problem is necessarily in the source data,
as much as in when i copied it into hdfs.

so my questions are these:
is this a known issue?
is there some way to determine which file or files are causing these
exceptions?
is there a way to run something like "gzip -t blah.gz" on the file in hdfs?
or maybe a checksum?
is there a reason other than a corrupt datafile that would be causing this?
in the original mapreduce paper, they talk about a mechanism to skip records
that cause problems.  is there a way to have hadoop skip these problematic
files and the associated records and continue with the job?


thanks,
colin

Reply via email to