parse_text folders output by Nutch.

Kasi Subrahmanyam Wed, 09 May 2012 02:38:17 -0700

HI Ali,
I also faced this error when i ran the jobs either in local or in a cluster.
I am able to solve this problem by removing the .crc file created in the
input folder for this job.
Please check that there is no .crc file in the input.
I hope this solves the problem.


Thanks,
Subbu

On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy <
safdar.kurei...@gmail.com> wrote:

> Hi,
>
> I've included both the Nutch and Hadoop mailing lists, since I don't know
> which one of the two is the root cause for this issue, and it might be
> possible to pursue a resolution from both sides.
>
> What I'm trying to do is to dump the contents of all the fetched pages
> from my nutch crawl -- about 600K of them. I've tried extracting this
> information initially from the *<segment>/parse_text* folder, but I kept
> receiving the error below, so I switched over to the *<segment>/content 
> *folder,
> but BOTH of these *consistently *give me the following Checksum Error
> exception which fails the map-reduce job. At the very least I'm hoping to
> get some tip(s) on how to ignore this error and let my job complete.
>
> *org.apache.hadoop.fs.ChecksumException: Checksum Error
>     at
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
>     at
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>     at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
>     at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
>     at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>     at
> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
>     at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> *
>
> I'm using the *SequenceFileInputFormat* to read the data in each case.
>
> I have also attached the Hadoop output (checksum-error.txt). I have no
> idea how to ignore this error or to debug it. I've tried setting the
> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce
> Conf object, but it makes no difference. The error still happens
> consistently, so it seems like I'm either not setting the right property,
> or that it is being ignored by Hadoop? Since the error is thrown down in
> the internals of Hadoop, there doesn't seem to be any other way to ignore
> the error either, without changing Hadoop code (that I'm not able to do at
> this point). Is this a problem with the data that was output by Nutch? Or
> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without
> hadoop), and I'm running the Hadoop job (below) purely as an application
> from Eclipse (not via the bin/hadoop script).*
>
> Any help or pointers on how to dig further with this would be greatly
> appreciated. If there is any other way for me to ignore these checksum
> errors and let the job complete, do please share that with me as well.
>
> Here is the code for the reader job using MapReduce:
>
> package org.q.alt.sc.nutch.readerjobs;
>
> import java.io.IOException;
>
> import org.apache.hadoop.conf.Configured;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.FileInputFormat;
> import org.apache.hadoop.mapred.FileOutputFormat;
> import org.apache.hadoop.mapred.JobClient;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapred.MapReduceBase;
> import org.apache.hadoop.mapred.Mapper;
> import org.apache.hadoop.mapred.OutputCollector;
> import org.apache.hadoop.mapred.Reporter;
> import org.apache.hadoop.mapred.SequenceFileInputFormat;
> import org.apache.hadoop.mapred.TextOutputFormat;
> import org.apache.hadoop.mapred.lib.IdentityReducer;
> import org.apache.hadoop.util.Tool;
> import org.apache.hadoop.util.ToolRunner;
> import org.apache.nutch.protocol.Content;
>
> public class SegmentContentReader extends Configured implements Tool {
>
>     /**
>      * @param args
>      */
>     public static void main(String[] args) throws Exception {
>         int exitCode = ToolRunner.run(new SegmentContentReader(), args);
>         System.exit(exitCode);
>     }
>
>     @Override
>     public int run(String[] args) throws Exception {
>         if (args.length != 2) {
>             System.out.printf(
>                 "Usage: %s [generic options] <input dir> <output dir>\n",
> getClass()
>                     .getSimpleName());
>             ToolRunner.printGenericCommandUsage(System.out);
>             return -1;
>         }
>
>         JobConf conf = new JobConf(getConf(), SegmentContentReader.class);
>         conf.setBoolean("io.skip.checksum.errors", true);
>         conf.setJobName(this.getClass().getName());
>         conf.setJarByClass(SegmentContentReader.class);
>
>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>         conf.setInputFormat(SequenceFileInputFormat.class);
>
>         conf.setOutputFormat(TextOutputFormat.class);
>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>         conf.setMapperClass(Mapper1.class);
>         conf.setMapOutputKeyClass(Text.class);
>         conf.setMapOutputValueClass(Text.class);
>
>         conf.setReducerClass(IdentityReducer.class);
>         conf.setOutputKeyClass(Text.class);
>         conf.setOutputValueClass(Text.class);
>
>         JobClient.runJob(conf);
>         return 0;
>     }
>
>     public static class Mapper1 extends MapReduceBase implements
> Mapper<Text, Content, Text, Text> {
>
>         @Override
>         public void map(Text key, Content value,
>                 OutputCollector<Text, Text> output, Reporter reporter)
>                 throws IOException {
>             String content = new String(value.getContent());
>             //System.out.println("Content: " + content);
>             output.collect(key, new Text(content));
>         }
>     }
> }
>
> Regards,
> Safdar
>

Re: Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Reply via email to