This can be of use
#bin/nutch readdb crawldir/crawdb -stats
It will give you the links in ya webdb approximately 2M links for 18GB ;-)
Regard
Ronny
Webmaster wrote:
Ok..
So I just had nutch do a moderate crawl and greated a 20GB link database.
How can I count the number of pages in this DB?
I have limited resources and the hadoop cluster I'm running on is quite
small at the moment. I'd like to index the whole web eventually so I need
to be able to calculate the average number of links per GB so I have an idea
as to how to distribute the resources over the small clusters I intend to
use for distributed searches.
I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly
build)..
Setup:
2x P3 800mhz 512mb ram
1x Celron 1.4ghz 512mb
Total disk space 750GB
If this works well in my sand box I hope to deploy 3 clusters of p4's with 6
machines/cluster and 10TB of disk space.
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.169 / Virus Database: 270.7.5/1705 - Release Date: 10/3/2008 8:18 AM
--
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
-> Ronald Muwonge
-> 'The M'
-> Africa's Search
-> www.mputa.com
-> www.africa.mputa.com
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm