parse_text folders output by Nutch.

Ali Safdar Kureishy Wed, 09 May 2012 22:00:32 -0700

Hi Subbu!

Thanks so much for this tip. Strangely, it doesn't seem to work for me ...
I still get the checksum error (though it appears to happen later on in the
job).


Has this workaround always worked for you? I also tried using the
setMaxMapperFailurePercentage() and setMaxReducerFailurePercentage()
settings (set them to 20% each), but I still see this chekcsum error.

Any thoughts/suggestions?

Thanks again!

Regards,
Safdar

On Wed, May 9, 2012 at 12:37 PM, Kasi Subrahmanyam
<kasisubbu...@gmail.com>wrote:

> HI Ali,
> I also faced this error when i ran the jobs either in local or in a
> cluster.
> I am able to solve this problem by removing the .crc file created in the
> input folder for this job.
> Please check that there is no .crc file in the input.
> I hope this solves the problem.
>
> Thanks,
> Subbu
>
>
> On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy <
> safdar.kurei...@gmail.com> wrote:
>
>> Hi,
>>
>> I've included both the Nutch and Hadoop mailing lists, since I don't know
>> which one of the two is the root cause for this issue, and it might be
>> possible to pursue a resolution from both sides.
>>
>> What I'm trying to do is to dump the contents of all the fetched pages
>> from my nutch crawl -- about 600K of them. I've tried extracting this
>> information initially from the *<segment>/parse_text* folder, but I kept
>> receiving the error below, so I switched over to the *<segment>/content 
>> *folder,
>> but BOTH of these *consistently *give me the following Checksum Error
>> exception which fails the map-reduce job. At the very least I'm hoping to
>> get some tip(s) on how to ignore this error and let my job complete.
>>
>> *org.apache.hadoop.fs.ChecksumException: Checksum Error
>>     at
>> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
>>     at
>> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>>     at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
>>     at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
>>     at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
>>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
>>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>>     at
>> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
>>     at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> *
>>
>> I'm using the *SequenceFileInputFormat* to read the data in each case.
>>
>> I have also attached the Hadoop output (checksum-error.txt). I have no
>> idea how to ignore this error or to debug it. I've tried setting the
>> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce
>> Conf object, but it makes no difference. The error still happens
>> consistently, so it seems like I'm either not setting the right property,
>> or that it is being ignored by Hadoop? Since the error is thrown down in
>> the internals of Hadoop, there doesn't seem to be any other way to ignore
>> the error either, without changing Hadoop code (that I'm not able to do at
>> this point). Is this a problem with the data that was output by Nutch? Or
>> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without
>> hadoop), and I'm running the Hadoop job (below) purely as an application
>> from Eclipse (not via the bin/hadoop script).*
>>
>> Any help or pointers on how to dig further with this would be greatly
>> appreciated. If there is any other way for me to ignore these checksum
>> errors and let the job complete, do please share that with me as well.
>>
>> Here is the code for the reader job using MapReduce:
>>
>> package org.q.alt.sc.nutch.readerjobs;
>>
>> import java.io.IOException;
>>
>> import org.apache.hadoop.conf.Configured;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapred.FileInputFormat;
>> import org.apache.hadoop.mapred.FileOutputFormat;
>> import org.apache.hadoop.mapred.JobClient;
>> import org.apache.hadoop.mapred.JobConf;
>> import org.apache.hadoop.mapred.MapReduceBase;
>> import org.apache.hadoop.mapred.Mapper;
>> import org.apache.hadoop.mapred.OutputCollector;
>> import org.apache.hadoop.mapred.Reporter;
>> import org.apache.hadoop.mapred.SequenceFileInputFormat;
>> import org.apache.hadoop.mapred.TextOutputFormat;
>> import org.apache.hadoop.mapred.lib.IdentityReducer;
>> import org.apache.hadoop.util.Tool;
>> import org.apache.hadoop.util.ToolRunner;
>> import org.apache.nutch.protocol.Content;
>>
>> public class SegmentContentReader extends Configured implements Tool {
>>
>>     /**
>>      * @param args
>>      */
>>     public static void main(String[] args) throws Exception {
>>         int exitCode = ToolRunner.run(new SegmentContentReader(), args);
>>         System.exit(exitCode);
>>     }
>>
>>     @Override
>>     public int run(String[] args) throws Exception {
>>         if (args.length != 2) {
>>             System.out.printf(
>>                 "Usage: %s [generic options] <input dir> <output dir>\n",
>> getClass()
>>                     .getSimpleName());
>>             ToolRunner.printGenericCommandUsage(System.out);
>>             return -1;
>>         }
>>
>>         JobConf conf = new JobConf(getConf(), SegmentContentReader.class);
>>         conf.setBoolean("io.skip.checksum.errors", true);
>>         conf.setJobName(this.getClass().getName());
>>         conf.setJarByClass(SegmentContentReader.class);
>>
>>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>>         conf.setInputFormat(SequenceFileInputFormat.class);
>>
>>         conf.setOutputFormat(TextOutputFormat.class);
>>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>
>>         conf.setMapperClass(Mapper1.class);
>>         conf.setMapOutputKeyClass(Text.class);
>>         conf.setMapOutputValueClass(Text.class);
>>
>>         conf.setReducerClass(IdentityReducer.class);
>>         conf.setOutputKeyClass(Text.class);
>>         conf.setOutputValueClass(Text.class);
>>
>>         JobClient.runJob(conf);
>>         return 0;
>>     }
>>
>>     public static class Mapper1 extends MapReduceBase implements
>> Mapper<Text, Content, Text, Text> {
>>
>>         @Override
>>         public void map(Text key, Content value,
>>                 OutputCollector<Text, Text> output, Reporter reporter)
>>                 throws IOException {
>>             String content = new String(value.getContent());
>>             //System.out.println("Content: " + content);
>>             output.collect(key, new Text(content));
>>         }
>>     }
>> }
>>
>> Regards,
>> Safdar
>>
>
>

Re: Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Reply via email to