Re: Distributed fetching only happening in one node ?

Alexander Aristov Fri, 08 Aug 2008 03:44:37 -0700

it should have no difference as all usrls from all files in the directory
are injected first.


Alex

2008/8/8 brainstorm <[EMAIL PROTECTED]>

> It was wondering... if I split the input urls like this:
>
> url1.txt url2.txt ... urlN.txt
>
> Will this input spread map jobs to N nodes ? Right now I'm using just
> one (big) urls.txt file (just 2 nodes actually fetching).
>
> Thanks in advance,
> Roman
>
> On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> > On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> >> brainstorm wrote:
> >>>
> >>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> >>> values 2 and 1 respectively *in the past*, same results. Right now, I
> >>> have 32 for both: same results as those settings are just a hint for
> >>> nutch.
> >>>
> >>> Regarding number of threads *per host* I tried with 10 and 20 in the
> >>> past, same results.
> >>
> >> Indeed, the default number of maps and reduces can be changed for any
> >> particular job - the number of maps is adjusted according to the number
> of
> >> input splits (InputFormat.getSplits()), and the number of reduces can be
> >> adjusted programmatically in the application.
> >
> >
> >
> > For now, my focus is on using nutch commandline tool:
> >
> > $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
> >
> > I assume (perhaps incorrectly), that nutch will determine the number
> > of maps & reduces dynamically. Is it true or should I switch to a
> > custom coded crawler using nutch API ?
> >
> > Btw, having a look at "getSplits", I suspect that Fetcher does
> > precisely what I don't want it to do: does not split inputs... then,
> > the less input splits, the less maps will be spread on nodes on fetch
> > phase, am I wrong ?:
> >
> > public class Fetcher
> > (...)
> >
> >    /** Don't split inputs, to keep things polite. */
> >    public InputSplit[] getSplits(JobConf job, int nSplits)
> >      throws IOException {
> >      Path[] files = listPaths(job);
> >      FileSystem fs = FileSystem.get(job);
> >      InputSplit[] splits = new InputSplit[files.length];
> >      for (int i = 0; i < files.length; i++) {
> >        splits[i] = new FileSplit(files[i], 0,
> >            fs.getFileStatus(files[i]).getLen(), (String[])null);
> >      }
> >      return splits;
> >    }
> >  }
> >
> >
> > Thanks in advance,
> > Roman
> >
> >
> >
> >> Back to your issue: I suspect that your fetchlist is highly homogenous,
> i.e.
> >> contains urls from a single host. Nutch makes sure that all urls from a
> >> single host end up in a single map task, to ensure the politeness
> settings,
> >> so that's probably why you see only a single map task fetching all urls.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>



-- 
Best Regards
Alexander Aristov

Re: Distributed fetching only happening in one node ?

Reply via email to