Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Andrzej Bialecki Sat, 26 Jan 2008 04:15:50 -0800

John Mendenhall wrote:

What is the host distribution of your fetchlist? I.e. how many uniquehosts do you have among all the URLs in the fetchlist? If it's just 1(or few) it could happen that they are mapped to a single map task. Thisis done on purpose - there is no central lock manager in Nutch / Hadoop,and Nutch needs a way to control the rate of access to any singlehost, for politeness reasons. Nutch can do this only if all urls fromthe same host are assigned to the same map task.
All hosts are the same.  Everyone of them.

If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site.  Please correct me if I am wrong with this
presumption.

It doesn't matter whether you use a distributed crawl or not - you stillare expected to crawl politely, meaning that you should not exceedcertain rate of requests / sec to any given host. Since all your urlscome from the same host, then no matter how many machines you trow atit, you will still be crawling at a rate of 1 page / 5 seconds (orwhatever you set in the nutch-site.xml). So, a single machine can managethis just fine.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Reply via email to