subject:"Re\: \[Nutch\-general\] How to determine the number of pages in the index\?"

Re: [Nutch-general] How to determine the number of pages in the index?

2007-07-28 Thread DES

look at org.apache.lucene.index.IndexReader.numDocs() method. You can
write a simple utility to run it in the shell.

On 7/28/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
 Is there a quick way of knowing how many pages are indexed (_not_ how many
 are referenced in crawldb as fetched URL's)? I could use Luke to peek inside
 the indexes and get the Number of documents, but they are located on a
 remote headless server with only SSH access... (OK, I actually did access
 them using Sftpdrive, but I'd like to have a command line to invoke in a
 shell script...)

 Enzo



-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to determine the number of pages in the index?

2007-07-28 Thread Enzo Michelangeli

Thanks. I was hoping for something already written, but I'm afraid I'll have 
to follow your suggestion...

By the way, at least in my case (pages only fetched with HTTP) Luke shows 
that the Number of documents is exactly equal to the frequency of the term 
http in the url field, so this also kind of works:

bin/nutch org.apache.nutch.searcher.NutchBean url:http \
| sed -n -e 's/Total hits: //p'

Enzo

- Original Message - 
From: DES [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, July 28, 2007 5:43 PM
Subject: Re: How to determine the number of pages in the index?


 look at org.apache.lucene.index.IndexReader.numDocs() method. You can
 write a simple utility to run it in the shell.

 On 7/28/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
 Is there a quick way of knowing how many pages are indexed (_not_ how 
 many
 are referenced in crawldb as fetched URL's)? I could use Luke to peek 
 inside
 the indexes and get the Number of documents, but they are located on a
 remote headless server with only SSH access... (OK, I actually did access
 them using Sftpdrive, but I'd like to have a command line to invoke in a
 shell script...)

 Enzo


 


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to determine the number of pages in the index?

Re: [Nutch-general] How to determine the number of pages in the index?

2 matches

Site Navigation

Mail list logo

Footer information