Thanks. I was hoping for something already written, but I'm afraid I'll have
to follow your suggestion...
By the way, at least in my case (pages only fetched with HTTP) Luke shows
that the Number of documents is exactly equal to the frequency of the term
http in the url field, so this also kind of works:
bin/nutch org.apache.nutch.searcher.NutchBean url:http \
| sed -n -e 's/Total hits: //p'
Enzo
- Original Message -
From: DES [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 28, 2007 5:43 PM
Subject: Re: How to determine the number of pages in the index?
look at org.apache.lucene.index.IndexReader.numDocs() method. You can
write a simple utility to run it in the shell.
On 7/28/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
Is there a quick way of knowing how many pages are indexed (_not_ how
many
are referenced in crawldb as fetched URL's)? I could use Luke to peek
inside
the indexes and get the Number of documents, but they are located on a
remote headless server with only SSH access... (OK, I actually did access
them using Sftpdrive, but I'd like to have a command line to invoke in a
shell script...)
Enzo
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now http://get.splunk.com/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general