On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.



For now, my focus is on using nutch commandline tool:

$ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5

I assume (perhaps incorrectly), that nutch will determine the number
of maps & reduces dynamically. Is it true or should I switch to a
custom coded crawler using nutch API ?

Btw, having a look at "getSplits", I suspect that Fetcher does
precisely what I don't want it to do: does not split inputs... then,
the less input splits, the less maps will be spread on nodes on fetch
phase, am I wrong ?:

public class Fetcher
(...)

    /** Don't split inputs, to keep things polite. */
    public InputSplit[] getSplits(JobConf job, int nSplits)
      throws IOException {
      Path[] files = listPaths(job);
      FileSystem fs = FileSystem.get(job);
      InputSplit[] splits = new InputSplit[files.length];
      for (int i = 0; i < files.length; i++) {
        splits[i] = new FileSplit(files[i], 0,
            fs.getFileStatus(files[i]).getLen(), (String[])null);
      }
      return splits;
    }
  }


Thanks in advance,
Roman



> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to