TDLN wrote:
I think the nutch readdb command only gives statistics for the crawldb
(crawled Pages) and not the index.


That's correct. You can use Lucene API to retrieve the number of
documents in the index, it's quite simple.

Something like that (you can use BSH to run it as a script, or compile
it in its own class), I'm writing this from my head so there may be some
errors:

import org.apache.lucene.index.*;

public class IndexStats {
   public static void main(String[] args) throws Exception {
      IndexReader ir = IndexReader.open(args[0]);
      System.out.println("Number of documents: " + ir.numDocs());
      ir.close();
   }
}


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to