Andrzej, thanks for your advise... I was using a 20MB url list provided by our customers, I've to write a script to determine the homogeneusness of the input seed urls file.
As a preliminar test, I've run a crawl using the integrated nutch DMOZ parser (as suggested on the official nutch tutorial), which I assume that chooses urls in a more heterogeneous fashion. The resulting url list, is a random enough sample ? ... In fact, being a directory, the number of repeated urls should be low, isn't it ? Bad news is that I'm getting the same results, just two nodes[1] are actually fetching :_( So I guess the problem is somewhere else (I already left the number of map & reduces to 2 and 1 as suggested in this thread). Any further ideas/tests/fixes ? Thanks a lot for your patience and support, Roman [1] one of them being the frontend (invariably) and the other one, a random node on each new crawl. On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > brainstorm wrote: >> >> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with >> values 2 and 1 respectively *in the past*, same results. Right now, I >> have 32 for both: same results as those settings are just a hint for >> nutch. >> >> Regarding number of threads *per host* I tried with 10 and 20 in the >> past, same results. > > Indeed, the default number of maps and reduces can be changed for any > particular job - the number of maps is adjusted according to the number of > input splits (InputFormat.getSplits()), and the number of reduces can be > adjusted programmatically in the application. > > Back to your issue: I suspect that your fetchlist is highly homogenous, i.e. > contains urls from a single host. Nutch makes sure that all urls from a > single host end up in a single map task, to ensure the politeness settings, > so that's probably why you see only a single map task fetching all urls. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
