HI Ali, I also faced this error when i ran the jobs either in local or in a cluster. I am able to solve this problem by removing the .crc file created in the input folder for this job. Please check that there is no .crc file in the input. I hope this solves the problem.
Thanks, Subbu On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy < safdar.kurei...@gmail.com> wrote: > Hi, > > I've included both the Nutch and Hadoop mailing lists, since I don't know > which one of the two is the root cause for this issue, and it might be > possible to pursue a resolution from both sides. > > What I'm trying to do is to dump the contents of all the fetched pages > from my nutch crawl -- about 600K of them. I've tried extracting this > information initially from the *<segment>/parse_text* folder, but I kept > receiving the error below, so I switched over to the *<segment>/content > *folder, > but BOTH of these *consistently *give me the following Checksum Error > exception which fails the map-reduce job. At the very least I'm hoping to > get some tip(s) on how to ignore this error and let my job complete. > > *org.apache.hadoop.fs.ChecksumException: Checksum Error > at > org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164) > at > org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101) > at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328) > at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358) > at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) > at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) > at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) > at > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330) > at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) > at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499) > at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) > at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > * > > I'm using the *SequenceFileInputFormat* to read the data in each case. > > I have also attached the Hadoop output (checksum-error.txt). I have no > idea how to ignore this error or to debug it. I've tried setting the > boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce > Conf object, but it makes no difference. The error still happens > consistently, so it seems like I'm either not setting the right property, > or that it is being ignored by Hadoop? Since the error is thrown down in > the internals of Hadoop, there doesn't seem to be any other way to ignore > the error either, without changing Hadoop code (that I'm not able to do at > this point). Is this a problem with the data that was output by Nutch? Or > is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without > hadoop), and I'm running the Hadoop job (below) purely as an application > from Eclipse (not via the bin/hadoop script).* > > Any help or pointers on how to dig further with this would be greatly > appreciated. If there is any other way for me to ignore these checksum > errors and let the job complete, do please share that with me as well. > > Here is the code for the reader job using MapReduce: > > package org.q.alt.sc.nutch.readerjobs; > > import java.io.IOException; > > import org.apache.hadoop.conf.Configured; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapred.FileInputFormat; > import org.apache.hadoop.mapred.FileOutputFormat; > import org.apache.hadoop.mapred.JobClient; > import org.apache.hadoop.mapred.JobConf; > import org.apache.hadoop.mapred.MapReduceBase; > import org.apache.hadoop.mapred.Mapper; > import org.apache.hadoop.mapred.OutputCollector; > import org.apache.hadoop.mapred.Reporter; > import org.apache.hadoop.mapred.SequenceFileInputFormat; > import org.apache.hadoop.mapred.TextOutputFormat; > import org.apache.hadoop.mapred.lib.IdentityReducer; > import org.apache.hadoop.util.Tool; > import org.apache.hadoop.util.ToolRunner; > import org.apache.nutch.protocol.Content; > > public class SegmentContentReader extends Configured implements Tool { > > /** > * @param args > */ > public static void main(String[] args) throws Exception { > int exitCode = ToolRunner.run(new SegmentContentReader(), args); > System.exit(exitCode); > } > > @Override > public int run(String[] args) throws Exception { > if (args.length != 2) { > System.out.printf( > "Usage: %s [generic options] <input dir> <output dir>\n", > getClass() > .getSimpleName()); > ToolRunner.printGenericCommandUsage(System.out); > return -1; > } > > JobConf conf = new JobConf(getConf(), SegmentContentReader.class); > conf.setBoolean("io.skip.checksum.errors", true); > conf.setJobName(this.getClass().getName()); > conf.setJarByClass(SegmentContentReader.class); > > FileInputFormat.addInputPath(conf, new Path(args[0])); > conf.setInputFormat(SequenceFileInputFormat.class); > > conf.setOutputFormat(TextOutputFormat.class); > FileOutputFormat.setOutputPath(conf, new Path(args[1])); > > conf.setMapperClass(Mapper1.class); > conf.setMapOutputKeyClass(Text.class); > conf.setMapOutputValueClass(Text.class); > > conf.setReducerClass(IdentityReducer.class); > conf.setOutputKeyClass(Text.class); > conf.setOutputValueClass(Text.class); > > JobClient.runJob(conf); > return 0; > } > > public static class Mapper1 extends MapReduceBase implements > Mapper<Text, Content, Text, Text> { > > @Override > public void map(Text key, Content value, > OutputCollector<Text, Text> output, Reporter reporter) > throws IOException { > String content = new String(value.getContent()); > //System.out.println("Content: " + content); > output.collect(key, new Text(content)); > } > } > } > > Regards, > Safdar >