Re: Counting the links in the DB

Ronny Sat, 04 Oct 2008 21:36:57 -0700

This can be of use
#bin/nutch readdb crawldir/crawdb -stats
It will give you the links in ya webdb approximately 2M links for 18GB ;-)
Regard
Ronny
Webmaster wrote:

Ok..


So I just had nutch do a moderate crawl and greated a 20GB link database.

How can I count the number of pages in this DB?

I have limited resources and the hadoop cluster I'm running on is quite
small at the moment.  I'd like to index the whole web eventually so I need
to be able to calculate the average number of links per GB so I have an idea
as to how to distribute the resources over the small clusters I intend to
use for distributed searches.

I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly
build)..

Setup:

2x P3 800mhz 512mb ram
1x Celron 1.4ghz 512mb
Total disk space 750GB

If this works well in my sand box I hope to deploy 3 clusters of p4's with 6
machines/cluster and 10TB of disk space.

------------------------------------------------------------------------



No virus found in this incoming message.

Checked by AVG - http://www.avg.comVersion: 8.0.169 / Virus Database: 270.7.5/1705 - Release Date: 10/3/2008 8:18 AM



--
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
-> Ronald Muwonge
-> 'The M'
-> Africa's Search
-> www.mputa.com
-> www.africa.mputa.com
mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

Re: Counting the links in the DB

Reply via email to