Ok..

So I just had nutch do a moderate crawl and greated a 20GB link database.

How can I count the number of pages in this DB?

I have limited resources and the hadoop cluster I'm running on is quite
small at the moment.  I'd like to index the whole web eventually so I need
to be able to calculate the average number of links per GB so I have an idea
as to how to distribute the resources over the small clusters I intend to
use for distributed searches.

I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly
build)..

Setup:

2x P3 800mhz 512mb ram
1x Celron 1.4ghz 512mb
Total disk space 750GB

If this works well in my sand box I hope to deploy 3 clusters of p4's with 6
machines/cluster and 10TB of disk space.

Reply via email to