There is a command to show stats on your database of links. It will show what 
has been fetched (if any) and what is waiting to be.
 
Keep in mind though, during a fetch if the page cannot be retrieved then it 
will not be indexed so only use this number as a estimate for the final indexed 
amount.

 
The command is below, it can take minutes or even hours to complete depending 
on the size of your database.
 
"bin/nutch readdb [path to crawldb] -stats"
 
 
----- Original Message ----
From: bbrown <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, May 16, 2007 4:42:05 PM
Subject: Generic Question about initial seed


This is kind of a generic question. Are there any stats on how many pages 
will get crawled based on some initial seed.  For example, if you seed the 
list from dmoz, how many pages will get indexed?  Lets say there are 4 
million, will 4 million only get indexed?

Or lets say I have 4000, will I get 30,000 crawled/indexed pages?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to