John Mendenhall wrote:
What is the host distribution of your fetchlist? I.e. how many unique
hosts do you have among all the URLs in the fetchlist? If it's just 1
(or few) it could happen that they are mapped to a single map task. This
is done on purpose - there is no central lock manager in Nutch / Hadoop,
and Nutch needs a way to control the rate of access to any single
host, for politeness reasons. Nutch can do this only if all urls from
the same host are assigned to the same map task.
All hosts are the same. Everyone of them.
If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site. Please correct me if I am wrong with this
presumption.
It doesn't matter whether you use a distributed crawl or not - you still
are expected to crawl politely, meaning that you should not exceed
certain rate of requests / sec to any given host. Since all your urls
come from the same host, then no matter how many machines you trow at
it, you will still be crawling at a rate of 1 page / 5 seconds (or
whatever you set in the nutch-site.xml). So, a single machine can manage
this just fine.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com