Hello,

I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get an 
error when trying to run DeleteDuplicates directly in Eclipse. The 
corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also 
work. When trying to run it with args "crawl1\\indexes". output in 
hadoop.log is:

2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates - Dedup: starting
2007-03-27 23:14:33,198 INFO  indexer.DeleteDuplicates - Dedup: adding 
indexes in: crawl1/indexes
2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner - job_uyjjzt
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
    at org.apache.lucene.util.BitVector.get(BitVector.java:72)
    at 
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
    at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
    at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
    at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - 
DeleteDuplicates: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
    at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
    at 
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
    at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
    at 
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)

Another thing I don't understand is that after crawling nutch claims 551 
documents while LUKE states the index has only 473 documents.

thanks in advance,

Tim Benke

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to