Thanks. I was hoping for something already written, but I'm afraid I'll have to follow your suggestion...

By the way, at least in my case (pages only fetched with HTTP) Luke shows that the "Number of documents" is exactly equal to the frequency of the term "http" in the "url" field, so this also kind of works:

bin/nutch org.apache.nutch.searcher.NutchBean url:http \
| sed -n -e 's/Total hits: //p'

Enzo

----- Original Message ----- From: "DES" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Saturday, July 28, 2007 5:43 PM
Subject: Re: How to determine the number of pages in the index?


look at org.apache.lucene.index.IndexReader.numDocs() method. You can
write a simple utility to run it in the shell.

On 7/28/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
Is there a quick way of knowing how many pages are indexed (_not_ how many are referenced in crawldb as fetched URL's)? I could use Luke to peek inside
the indexes and get the "Number of documents", but they are located on a
remote headless server with only SSH access... (OK, I actually did access
them using Sftpdrive, but I'd like to have a command line to invoke in a
shell script...)

Enzo




Reply via email to