Re: Distributed fetching only happening in one node ?

brainstorm Fri, 08 Aug 2008 00:38:50 -0700

It was wondering... if I split the input urls like this:

url1.txt url2.txt ... urlN.txt


Will this input spread map jobs to N nodes ? Right now I'm using just
one (big) urls.txt file (just 2 nodes actually fetching).

Thanks in advance,
Roman

On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>> brainstorm wrote:
>>>
>>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>>> values 2 and 1 respectively *in the past*, same results. Right now, I
>>> have 32 for both: same results as those settings are just a hint for
>>> nutch.
>>>
>>> Regarding number of threads *per host* I tried with 10 and 20 in the
>>> past, same results.
>>
>> Indeed, the default number of maps and reduces can be changed for any
>> particular job - the number of maps is adjusted according to the number of
>> input splits (InputFormat.getSplits()), and the number of reduces can be
>> adjusted programmatically in the application.
>
>
>
> For now, my focus is on using nutch commandline tool:
>
> $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
>
> I assume (perhaps incorrectly), that nutch will determine the number
> of maps & reduces dynamically. Is it true or should I switch to a
> custom coded crawler using nutch API ?
>
> Btw, having a look at "getSplits", I suspect that Fetcher does
> precisely what I don't want it to do: does not split inputs... then,
> the less input splits, the less maps will be spread on nodes on fetch
> phase, am I wrong ?:
>
> public class Fetcher
> (...)
>
>    /** Don't split inputs, to keep things polite. */
>    public InputSplit[] getSplits(JobConf job, int nSplits)
>      throws IOException {
>      Path[] files = listPaths(job);
>      FileSystem fs = FileSystem.get(job);
>      InputSplit[] splits = new InputSplit[files.length];
>      for (int i = 0; i < files.length; i++) {
>        splits[i] = new FileSplit(files[i], 0,
>            fs.getFileStatus(files[i]).getLen(), (String[])null);
>      }
>      return splits;
>    }
>  }
>
>
> Thanks in advance,
> Roman
>
>
>
>> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
>> contains urls from a single host. Nutch makes sure that all urls from a
>> single host end up in a single map task, to ensure the politeness settings,
>> so that's probably why you see only a single map task fetching all urls.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Distributed fetching only happening in one node ?

Reply via email to