Thanks Dennis.

But, hm, I don't get it 100% yet.  I looked at Generator.java and I see this:

    if (numLists == -1) {                         // for politeness make
      numLists = job.getNumMapTasks();            // a partition per fetch task
    }

Thus, when -numFetchers is not given, the number of fetchlists in a segment 
will equal the number of map tasks (which is what I see happening).  So, in 
your example with 10 machines in a cluster, that -numFetchers 10 would not 
really be needed, as Generator would already know to generate 10 fetchlists.  
So when *does* one want to specify -numFetchers that's different (higher?) than 
the number of total map tasks in a cluster as specified by the mapred.map.tasks 
config in hadoop-site/default.xml?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, April 13, 2008 11:11:25 AM
Subject: Re: Parallel operations in fetch

Running generate with -numFetchers will create n number of reduce tasks 
for as generator output.  This is used as input for the fetchers so when 
the fetcher runs it will break the job into n number of tasks.  That 
doesn't mean that they will *all* run in parallel.  That is dependent on 
the max tasks per server and the number of total servers running the job 
in the hadoop cluster.

But say you have 10 machines in your cluster and you do a generate 
-numFetchers 10 then they should all run in parallel.

Dennis

[EMAIL PROTECTED] wrote:
> Hi,
> 
> I was able to dig out a related message/threads from "only" 3 years ago:
> 
> http://markmail.org/message/dp6a6isdboz46wez#query:+page:1+mid:o7p2iqqp66zumwcs+state:results
> 
> Is the story with running generate with -numFetchers N and running N parallel 
> fetch jobs still true?
> 
> Thanks,
> Otis 
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message ----
> From: Tomislav Poljak <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, April 10, 2008 2:57:04 PM
> Subject: Parallel operations in fetch
> 
> Hi,
> is there a way to do some of these operations in parallel safely:
> generate, fetch, parse and updatedb (and if so, how)?
> 
> thanks,
>          Tomislav
> 
> 
> 
> 



Reply via email to