Hi Andrzej, I am experiencing similar problems distributing the fetch across multiple nodes. I am crawling a single host in an intranet and I would like to know how I can modify nutch's behavior so that it distributes the search over multiple nodes.
Soila Andrzej Bialecki wrote: > > brainstorm wrote: >> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with >> values 2 and 1 respectively *in the past*, same results. Right now, I >> have 32 for both: same results as those settings are just a hint for >> nutch. >> >> Regarding number of threads *per host* I tried with 10 and 20 in the >> past, same results. > > Indeed, the default number of maps and reduces can be changed for any > particular job - the number of maps is adjusted according to the number > of input splits (InputFormat.getSplits()), and the number of reduces can > be adjusted programmatically in the application. > > Back to your issue: I suspect that your fetchlist is highly homogenous, > i.e. contains urls from a single host. Nutch makes sure that all urls > from a single host end up in a single map task, to ensure the politeness > settings, so that's probably why you see only a single map task fetching > all urls. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- View this message in context: http://www.nabble.com/Distributed-fetching-only-happening-in-one-node---tp18429531p18915705.html Sent from the Nutch - User mailing list archive at Nabble.com.
