Hi Subbu! Thanks so much for this tip. Strangely, it doesn't seem to work for me ... I still get the checksum error (though it appears to happen later on in the job).
Has this workaround always worked for you? I also tried using the setMaxMapperFailurePercentage() and setMaxReducerFailurePercentage() settings (set them to 20% each), but I still see this chekcsum error. Any thoughts/suggestions? Thanks again! Regards, Safdar On Wed, May 9, 2012 at 12:37 PM, Kasi Subrahmanyam <kasisubbu...@gmail.com>wrote: > HI Ali, > I also faced this error when i ran the jobs either in local or in a > cluster. > I am able to solve this problem by removing the .crc file created in the > input folder for this job. > Please check that there is no .crc file in the input. > I hope this solves the problem. > > Thanks, > Subbu > > > On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy < > safdar.kurei...@gmail.com> wrote: > >> Hi, >> >> I've included both the Nutch and Hadoop mailing lists, since I don't know >> which one of the two is the root cause for this issue, and it might be >> possible to pursue a resolution from both sides. >> >> What I'm trying to do is to dump the contents of all the fetched pages >> from my nutch crawl -- about 600K of them. I've tried extracting this >> information initially from the *<segment>/parse_text* folder, but I kept >> receiving the error below, so I switched over to the *<segment>/content >> *folder, >> but BOTH of these *consistently *give me the following Checksum Error >> exception which fails the map-reduce job. At the very least I'm hoping to >> get some tip(s) on how to ignore this error and let my job complete. >> >> *org.apache.hadoop.fs.ChecksumException: Checksum Error >> at >> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164) >> at >> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101) >> at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328) >> at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358) >> at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) >> at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) >> at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) >> at >> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330) >> at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) >> at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156) >> at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499) >> at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) >> at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) >> * >> >> I'm using the *SequenceFileInputFormat* to read the data in each case. >> >> I have also attached the Hadoop output (checksum-error.txt). I have no >> idea how to ignore this error or to debug it. I've tried setting the >> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce >> Conf object, but it makes no difference. The error still happens >> consistently, so it seems like I'm either not setting the right property, >> or that it is being ignored by Hadoop? Since the error is thrown down in >> the internals of Hadoop, there doesn't seem to be any other way to ignore >> the error either, without changing Hadoop code (that I'm not able to do at >> this point). Is this a problem with the data that was output by Nutch? Or >> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without >> hadoop), and I'm running the Hadoop job (below) purely as an application >> from Eclipse (not via the bin/hadoop script).* >> >> Any help or pointers on how to dig further with this would be greatly >> appreciated. If there is any other way for me to ignore these checksum >> errors and let the job complete, do please share that with me as well. >> >> Here is the code for the reader job using MapReduce: >> >> package org.q.alt.sc.nutch.readerjobs; >> >> import java.io.IOException; >> >> import org.apache.hadoop.conf.Configured; >> import org.apache.hadoop.fs.Path; >> import org.apache.hadoop.io.Text; >> import org.apache.hadoop.mapred.FileInputFormat; >> import org.apache.hadoop.mapred.FileOutputFormat; >> import org.apache.hadoop.mapred.JobClient; >> import org.apache.hadoop.mapred.JobConf; >> import org.apache.hadoop.mapred.MapReduceBase; >> import org.apache.hadoop.mapred.Mapper; >> import org.apache.hadoop.mapred.OutputCollector; >> import org.apache.hadoop.mapred.Reporter; >> import org.apache.hadoop.mapred.SequenceFileInputFormat; >> import org.apache.hadoop.mapred.TextOutputFormat; >> import org.apache.hadoop.mapred.lib.IdentityReducer; >> import org.apache.hadoop.util.Tool; >> import org.apache.hadoop.util.ToolRunner; >> import org.apache.nutch.protocol.Content; >> >> public class SegmentContentReader extends Configured implements Tool { >> >> /** >> * @param args >> */ >> public static void main(String[] args) throws Exception { >> int exitCode = ToolRunner.run(new SegmentContentReader(), args); >> System.exit(exitCode); >> } >> >> @Override >> public int run(String[] args) throws Exception { >> if (args.length != 2) { >> System.out.printf( >> "Usage: %s [generic options] <input dir> <output dir>\n", >> getClass() >> .getSimpleName()); >> ToolRunner.printGenericCommandUsage(System.out); >> return -1; >> } >> >> JobConf conf = new JobConf(getConf(), SegmentContentReader.class); >> conf.setBoolean("io.skip.checksum.errors", true); >> conf.setJobName(this.getClass().getName()); >> conf.setJarByClass(SegmentContentReader.class); >> >> FileInputFormat.addInputPath(conf, new Path(args[0])); >> conf.setInputFormat(SequenceFileInputFormat.class); >> >> conf.setOutputFormat(TextOutputFormat.class); >> FileOutputFormat.setOutputPath(conf, new Path(args[1])); >> >> conf.setMapperClass(Mapper1.class); >> conf.setMapOutputKeyClass(Text.class); >> conf.setMapOutputValueClass(Text.class); >> >> conf.setReducerClass(IdentityReducer.class); >> conf.setOutputKeyClass(Text.class); >> conf.setOutputValueClass(Text.class); >> >> JobClient.runJob(conf); >> return 0; >> } >> >> public static class Mapper1 extends MapReduceBase implements >> Mapper<Text, Content, Text, Text> { >> >> @Override >> public void map(Text key, Content value, >> OutputCollector<Text, Text> output, Reporter reporter) >> throws IOException { >> String content = new String(value.getContent()); >> //System.out.println("Content: " + content); >> output.collect(key, new Text(content)); >> } >> } >> } >> >> Regards, >> Safdar >> > >