I tried a somewhat naive version of this using streaming, and it failed miserably.
I went with: bin/hadoop jar ./contrib/streaming/hadoop-0.16.1-streaming.jar -input views -output md5out -mapper org.apache.hadoop.maprred.lib.IdentityMapper -reducer "md5sum -b -" ...but I think that's the wrong semantic. The input directory is a bunch of gz files. Are they passed to the reducer (md5sum) as a whole, or are they decompressed and passed? Are they passed in on stdin? Is there a way to ensure they're passed as a complete file? Would I need to write my own InputFormat handler, maybe extending FileInputFormat to ensure the files aren't decompressed or split? -colin On Tue, Apr 8, 2008 at 6:15 PM, Norbert Burger <[EMAIL PROTECTED]> wrote: > Colin, how about writing a streaming mapper which simply runs md5sum on > each > file it gets as input? Run this task along with the identity reducer, and > you should be able to identify pretty quickly if there's HDFS corruption > issue. > > Norbert > > On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas <[EMAIL PROTECTED]> wrote: > > > so, in an attempt to track down this problem, i've stripped out most of > > the > > files for input, trying to identify which ones are causing the problem. > > > > i've narrowed it down, but i can't pinpoint it. i keep getting these > > incorrect data check errors below, but the .gz files test fine with > gzip. > > > > is there some way to run an md5 or something on the files in hdfs and > > compare it to the checksum of the files on my local machine? > > > > i've looked around the lists and through the various options to send to > > .../bin/hadoop, but nothing is jumping out at me. > > > > this is particularly frustrating because it's causing my jobs to fail, > > rather than skipping the problematic input files. i've also looked > > through > > the conf file and don't see anything similar about skipping bad files > > without killing the job. > > > > -colin > > > > > > On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas <[EMAIL PROTECTED]> > wrote: > > > > > running a job on my 5 node cluster, i get these intermittent > exceptions > > in > > > my logs: > > > > > > java.io.IOException: incorrect data check > > > at > > > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native > > Method) > > > > > > at > > > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218) > > > at > > > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80) > > > at > > > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) > > > > > > at java.io.InputStream.read(InputStream.java:89) > > > at > > > org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88) > > > at > > > org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114) > > > > > > at > > > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215) > > > at > > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37) > > > at > > > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) > > > > > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) > > > at > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084 > > > > > > > > > they occur accross all the nodes, but i can't figure out which file is > > > causing the problem. i'm working on the assumption it's a specific > file > > > because it's precisely the same error that occurs on each node. i've > > > scoured the logs and can't find any reference to which file caused the > > > hiccup. but this is causing the job to fail. other files are > processed > > > without a problem. the files are 720 .gz files, ~100mb each. other > > files > > > are processed on each node without a problem. i'm in the middle > testing > > the > > > .gz files, but i don't think the problem is necessarily in the source > > data, > > > as much as in when i copied it into hdfs. > > > > > > so my questions are these: > > > is this a known issue? > > > is there some way to determine which file or files are causing these > > > exceptions? > > > is there a way to run something like "gzip -t blah.gz" on the file in > > > hdfs? or maybe a checksum? > > > is there a reason other than a corrupt datafile that would be causing > > > this? > > > in the original mapreduce paper, they talk about a mechanism to skip > > > records that cause problems. is there a way to have hadoop skip these > > > problematic files and the associated records and continue with the > job? > > > > > > > > > thanks, > > > colin > > > > > >