Hey,
I dont think its due to politeness
If all the urls are single domain, the default Partitioner
(HashPartitioner : hash of domain of url) will split it in ONE split.
So now, only one map is heavily loaded with all urls
Hence,
if u cud implement u r own Partioner, urls will be split across the cluster
Venkat Shyam wrote:
I am trying to deploy a large intranet crawl (single domain - around 500,000
documents) and want to use distributed crawl mechanism with atleast 3 to 4
nodes for crawl/indexing. I have not been able to get nutch/hadoop to work in
distributed fashion for a single domain. It looks like due to politeness a
single domain can be crawled only from a single machine. If anyone has any
experience crawling large intranet site please share.
Shyam
---------------------------------
Shape Yahoo! in your own image. Join our Network Research Panel today!
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.