Hello,

I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get an error when trying to run DeleteDuplicates directly in Eclipse. The corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also work. When trying to run it with args "crawl1\\indexes". output in hadoop.log is:

2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates - Dedup: starting
2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: crawl1/indexes
2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner - job_uyjjzt
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
  at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - DeleteDuplicates: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)

Another thing I don't understand is that after crawling nutch claims 551 documents while LUKE states the index has only 473 documents.

thanks in advance,

Tim Benke

Reply via email to