Ok.. So I just had nutch do a moderate crawl and greated a 20GB link database.
How can I count the number of pages in this DB? I have limited resources and the hadoop cluster I'm running on is quite small at the moment. I'd like to index the whole web eventually so I need to be able to calculate the average number of links per GB so I have an idea as to how to distribute the resources over the small clusters I intend to use for distributed searches. I've also found Nutch/Hadoop to work extremely well (Sept 29 nightly build).. Setup: 2x P3 800mhz 512mb ram 1x Celron 1.4ghz 512mb Total disk space 750GB If this works well in my sand box I hope to deploy 3 clusters of p4's with 6 machines/cluster and 10TB of disk space.
