I just wanted to tell you that I found my error. In fact with the former nutch it never happened to me that documents were deleted after crawling, but now that was the problem. DeleteDuplicates needs an optimized index to work on, what I mean is that all the deletions have to be already flushed because then the number of documents in the index is correct and no ArrayIndexOutOfBoundsException can occur. And of course the duplicates were also deleted as far as I can tell, because the index is much smaller now...
Tim Benke wrote: > I guess the problem lies in the Configuration which I create with > NutchConfiguration.create() because Nutch uses the DeleteDuplicates > class on indices anyway after finishing a crawl right? > What is really odd to me is that the number of documents reportet by > LUKE 0.7 and at the end of the crawl of Nutch-nightly differs. I am > refering to the number of documents merged at the end of each crawl.. > Has anybody an idea what could cause this inconsistence? > > Tim Benke wrote: >> Hello, >> >> I downloaded nutch-2007-03-27_06-52-06 and crawling works fine. I get >> an error when trying to run DeleteDuplicates directly in Eclipse. The >> corresponding "crawl1\\index" opens fine in LUKE 0.7 and queries also >> work. When trying to run it with args "crawl1\\indexes". output in >> hadoop.log is: >> >> 2007-03-27 23:14:33,151 INFO indexer.DeleteDuplicates - Dedup: starting >> 2007-03-27 23:14:33,198 INFO indexer.DeleteDuplicates - Dedup: >> adding indexes in: crawl1/indexes >> 2007-03-27 23:14:33,792 WARN mapred.LocalJobRunner - job_uyjjzt >> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 550 >> at org.apache.lucene.util.BitVector.get(BitVector.java:72) >> at >> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) >> at >> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) >> >> >> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) >> 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates - >> DeleteDuplicates: java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) >> at >> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) >> >> at >> org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:506) >> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) >> at >> org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:490) >> >> >> Another thing I don't understand is that after crawling nutch claims >> 551 documents while LUKE states the index has only 473 documents. >> >> thanks in advance, >> >> Tim Benke > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
