Andrzej, thanks for your advise... I was using a 20MB url list
provided by our customers, I've to write a script to determine the
homogeneusness of the input seed urls file.

As a preliminar test, I've run a crawl using the integrated nutch DMOZ
parser (as suggested on the official nutch tutorial), which I assume
that chooses urls in a more heterogeneous fashion. The resulting url
list, is a random enough sample ? ... In fact, being a directory, the
number of repeated urls should be low, isn't it ?

Bad news is that I'm getting the same results, just two nodes[1] are
actually fetching :_( So I guess the problem is somewhere else (I
already left the number of map & reduces to 2 and 1 as suggested in
this thread).

Any further ideas/tests/fixes ?

Thanks a lot for your patience and support,
Roman

[1] one of them being the frontend (invariably) and the other one, a
random node on each new crawl.

On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.
>
> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to