Re: crawling crashed at dedup

Samuel Guo Mon, 28 Apr 2008 17:58:29 -0700

edwinchiu ??:

I've seen quite a few postings with the same crash callstack as this but
didn't seem to find any reply on this issue. Basically a simply crawl (on a
single machine) on some sites would just crash at even topN and depth of 1.
Try http://www.peperonity.com/ for example.


std out err:
Indexer: linkdb: crawlPep/linkdb
Indexer: adding segment: crawlPep/segments/20080424194340
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawlPep/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

in the log file:
2008-04-24 19:43:54,877 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes in: crawlPep/indexes
2008-04-24 19:43:55,941 WARN  mapred.LocalJobRunner - job_fn38lq
java.lang.ArrayIndexOutOfBoundsException: -1
  at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
  at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
  at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
  at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

seems like it's a pretty common issue. can someone shed some light on it
before we do some heavier debugging and println?

thanks much.

http://issues.apache.org/jira/browse/NUTCH-525

this patch may be helpful :)

Re: crawling crashed at dedup

Reply via email to