John Mendenhall wrote:

What is the host distribution of your fetchlist? I.e. how many unique hosts do you have among all the URLs in the fetchlist? If it's just 1 (or few) it could happen that they are mapped to a single map task. This is done on purpose - there is no central lock manager in Nutch / Hadoop, and Nutch needs a way to control the rate of access to any single host, for politeness reasons. Nutch can do this only if all urls from the same host are assigned to the same map task.

All hosts are the same.  Everyone of them.

If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site.  Please correct me if I am wrong with this
presumption.

It doesn't matter whether you use a distributed crawl or not - you still are expected to crawl politely, meaning that you should not exceed certain rate of requests / sec to any given host. Since all your urls come from the same host, then no matter how many machines you trow at it, you will still be crawling at a rate of 1 page / 5 seconds (or whatever you set in the nutch-site.xml). So, a single machine can manage this just fine.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to