I've seen quite a few postings with the same crash callstack as this but
didn't seem to find any reply on this issue. Basically a simply crawl (on a
single machine) on some sites would just crash at even topN and depth of 1.
Try http://www.peperonity.com/ for example.
std out err:
Indexer: linkdb: crawlPep/linkdb
Indexer: adding segment: crawlPep/segments/20080424194340
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawlPep/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
in the log file:
2008-04-24 19:43:54,877 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: crawlPep/indexes
2008-04-24 19:43:55,941 WARN mapred.LocalJobRunner - job_fn38lq
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
seems like it's a pretty common issue. can someone shed some light on it
before we do some heavier debugging and println?
thanks much.
--
View this message in context:
http://www.nabble.com/crawling-crashed-at-dedup-tp16884120p16884120.html
Sent from the Nutch - User mailing list archive at Nabble.com.