[SOLVED] Re: Exception in DeleteDuplicates in nutch-nightly

Tim Benke Thu, 29 Mar 2007 08:01:14 -0800

I just wanted to tell you that I found my error. In fact with the formernutch it never happened to me that documents were deleted aftercrawling, but now that was the problem.DeleteDuplicates needs an optimized index to work on, what I mean isthat all the deletions have to be already flushed because then thenumber of documents in the index is correct and noArrayIndexOutOfBoundsException can occur.And of course the duplicates were also deleted as far as I can tell,because the index is much smaller now...


Tim Benke wrote:

I guess the problem lies in the Configuration which I create withNutchConfiguration.create() because Nutch uses the DeleteDuplicatesclass on indices anyway after finishing a crawl right?What is really odd to me is that the number of documents reportet byLUKE 0.7 and at the end of the crawl of Nutch-nightly differs. I amrefering to the number of documents merged at the end of each crawl..
Has anybody an idea what could cause this inconsistence?
Tim Benke wrote:
Hello,
I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I getan error when trying to run DeleteDuplicates directly in Eclipse. Thecorresponding "crawl1\\index" opens fine in LUKE 0.7 and queries alsowork. When trying to run it with args "crawl1\\indexes". output inhadoop.log is:
2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates - Dedup: starting
2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup:adding indexes in: crawl1/indexes
2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner - job_uyjjzt
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
atorg.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)atorg.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
  at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates -DeleteDuplicates: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
atorg.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)atorg.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
atorg.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)
Another thing I don't understand is that after crawling nutch claims551 documents while LUKE states the index has only 473 documents.
thanks in advance,

Tim Benke

[SOLVED] Re: Exception in DeleteDuplicates in nutch-nightly

Reply via email to