I tried a somewhat naive version of this using streaming, and it failed
miserably.

I went with:

bin/hadoop jar ./contrib/streaming/hadoop-0.16.1-streaming.jar -input views
-output md5out -mapper org.apache.hadoop.maprred.lib.IdentityMapper -reducer
"md5sum -b -"

...but I think that's the wrong semantic.

The input directory is a bunch of gz files.  Are they passed to the reducer
(md5sum) as a whole, or are they decompressed and passed?  Are they passed
in on stdin?

Is there a way to ensure they're passed as a complete file?  Would I need to
write my own InputFormat handler, maybe extending FileInputFormat to ensure
the files aren't decompressed or split?



-colin


On Tue, Apr 8, 2008 at 6:15 PM, Norbert Burger <[EMAIL PROTECTED]>
wrote:

> Colin, how about writing a streaming mapper which simply runs md5sum on
> each
> file it gets as input?  Run this task along with the identity reducer, and
> you should be able to identify pretty quickly if there's HDFS corruption
> issue.
>
> Norbert
>
> On Tue, Apr 8, 2008 at 5:50 PM, Colin Freas <[EMAIL PROTECTED]> wrote:
>
> > so, in an attempt to track down this problem, i've stripped out most of
> > the
> > files for input, trying to identify which ones are causing the problem.
> >
> > i've narrowed it down, but i can't pinpoint it.  i keep getting these
> > incorrect data check errors below, but the .gz files test fine with
> gzip.
> >
> > is there some way to run an md5 or something on the files in hdfs and
> > compare it to the checksum of the files on my local machine?
> >
> > i've looked around the lists and through the various options to send to
> > .../bin/hadoop, but nothing is jumping out at me.
> >
> > this is particularly frustrating because it's causing my jobs to fail,
> > rather than skipping the problematic input files.  i've also looked
> > through
> > the conf file and don't see anything similar about skipping bad files
> > without killing the job.
> >
> > -colin
> >
> >
> > On Tue, Apr 8, 2008 at 11:53 AM, Colin Freas <[EMAIL PROTECTED]>
> wrote:
> >
> > > running a job on my 5 node cluster, i get these intermittent
> exceptions
> > in
> > > my logs:
> > >
> > > java.io.IOException: incorrect data check
> > >       at
> >
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> > Method)
> > >
> > >       at
> >
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:218)
> > >       at
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:80)
> > >       at
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> > >
> > >       at java.io.InputStream.read(InputStream.java:89)
> > >       at
> >
> org.apache.hadoop.mapred.LineRecordReader$LineReader.backfill(LineRecordReader.java:88)
> > >       at
> >
> org.apache.hadoop.mapred.LineRecordReader$LineReader.readLine(LineRecordReader.java:114)
> > >
> > >       at
> >
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:215)
> > >       at
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:37)
> > >       at
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147)
> > >
> > >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> > >       at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084
> > >
> > >
> > > they occur accross all the nodes, but i can't figure out which file is
> > > causing the problem.  i'm working on the assumption it's a specific
> file
> > > because it's precisely the same error that occurs on each node.  i've
> > > scoured the logs and can't find any reference to which file caused the
> > > hiccup.  but this is causing the job to fail.  other files are
> processed
> > > without a problem.  the files are 720 .gz files, ~100mb each.  other
> > files
> > > are processed on each node without a problem.  i'm in the middle
> testing
> > the
> > > .gz files, but i don't think the problem is necessarily in the source
> > data,
> > > as much as in when i copied it into hdfs.
> > >
> > > so my questions are these:
> > > is this a known issue?
> > > is there some way to determine which file or files are causing these
> > > exceptions?
> > > is there a way to run something like "gzip -t blah.gz" on the file in
> > > hdfs?  or maybe a checksum?
> > > is there a reason other than a corrupt datafile that would be causing
> > > this?
> > > in the original mapreduce paper, they talk about a mechanism to skip
> > > records that cause problems.  is there a way to have hadoop skip these
> > > problematic files and the associated records and continue with the
> job?
> > >
> > >
> > > thanks,
> > > colin
> > >
> >
>

Reply via email to