edwinchiu ??:
I've seen quite a few postings with the same crash callstack as this but didn't seem to find any reply on this issue. Basically a simply crawl (on a single machine) on some sites would just crash at even topN and depth of 1. Try http://www.peperonity.com/ for example.std out err: Indexer: linkdb: crawlPep/linkdb Indexer: adding segment: crawlPep/segments/20080424194340 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawlPep/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) in the log file: 2008-04-24 19:43:54,877 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: crawlPep/indexes 2008-04-24 19:43:55,941 WARN mapred.LocalJobRunner - job_fn38lq java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) seems like it's a pretty common issue. can someone shed some light on it before we do some heavier debugging and println? thanks much.
http://issues.apache.org/jira/browse/NUTCH-525 this patch may be helpful :)
