Exception in DeleteDuplicates in nutch-nightly

Tim Benke Tue, 27 Mar 2007 14:14:15 -0800

Hello,

I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get anerror when trying to run DeleteDuplicates directly in Eclipse. Thecorresponding "crawl1\\index" opens fine in LUKE 0.7 and queries alsowork. When trying to run it with args "crawl1\\indexes". output inhadoop.log is:


2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates - Dedup: starting

2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: addingindexes in: crawl1/indexes

2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner - job_uyjjzt
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
  at org.apache.lucene.util.BitVector.get(BitVector.java:72)

atorg.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)atorg.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)

  at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates -DeleteDuplicates: java.io.IOException: Job failed!

  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

atorg.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)atorg.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)

  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)

atorg.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)

Another thing I don't understand is that after crawling nutch claims 551documents while LUKE states the index has only 473 documents.


thanks in advance,

Tim Benke

Exception in DeleteDuplicates in nutch-nightly

Reply via email to