running a job on my 5 node cluster, i get these intermittent exceptions in my logs:
java.io.IOException: incorrect data check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:89) at org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) at org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 they occur accross all the nodes, but i can't figure out which file is causing the problem. i'm working on the assumption it's a specific file because it's precisely the same error that occurs on each node. i've scoured the logs and can't find any reference to which file caused the hiccup. but this is causing the job to fail. other files are processed without a problem. the files are 720 .gz files, ~100mb each. other files are processed on each node without a problem. i'm in the middle testing the .gz files, but i don't think the problem is necessarily in the source data, as much as in when i copied it into hdfs. so my questions are these: is this a known issue? is there some way to determine which file or files are causing these exceptions? is there a way to run something like "gzip -t blah.gz" on the file in hdfs? or maybe a checksum? is there a reason other than a corrupt datafile that would be causing this? in the original mapreduce paper, they talk about a mechanism to skip records that cause problems. is there a way to have hadoop skip these problematic files and the associated records and continue with the job? thanks, colin