Re: How to determine the number of pages in the index?

Enzo Michelangeli Sat, 28 Jul 2007 04:01:10 -0700

Thanks. I was hoping for something already written, but I'm afraid I'll haveto follow your suggestion...

By the way, at least in my case (pages only fetched with HTTP) Luke showsthat the "Number of documents" is exactly equal to the frequency of the term"http" in the "url" field, so this also kind of works:


bin/nutch org.apache.nutch.searcher.NutchBean url:http \
| sed -n -e 's/Total hits: //p'

Enzo

----- Original Message -----From: "DES" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Saturday, July 28, 2007 5:43 PM
Subject: Re: How to determine the number of pages in the index?

look at org.apache.lucene.index.IndexReader.numDocs() method. You can
write a simple utility to run it in the shell.

On 7/28/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:

Is there a quick way of knowing how many pages are indexed (_not_ howmanyare referenced in crawldb as fetched URL's)? I could use Luke to peekinside
the indexes and get the "Number of documents", but they are located on a
remote headless server with only SSH access... (OK, I actually did access
them using Sftpdrive, but I'd like to have a command line to invoke in a
shell script...)

Enzo

Re: How to determine the number of pages in the index?

Reply via email to