Hello,
I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get an
error when trying to run DeleteDuplicates directly in Eclipse. The
corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also
work. When trying to run it with args "crawl1\\indexes". output in
hadoop.log is:
2007-03-27 23:14:33,151 INFO indexer.DeleteDuplicates - Dedup: starting
2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: crawl1/indexes
2007-03-27 23:14:33,792 WARN mapred.LocalJobRunner - job_uyjjzt
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550
at org.apache.lucene.util.BitVector.get(BitVector.java:72)
at
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates -
DeleteDuplicates: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490)
Another thing I don't understand is that after crawling nutch claims 551
documents while LUKE states the index has only 473 documents.
thanks in advance,
Tim Benke
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general