True the numFetchers wouldn't be needed there, was just trying to
illustrate.
Although I have never used it myself (never needed to because of its
default behavior), I guess it could be used if you want only one machine
to fetch all of the urls you could do a -numFetchers 1.
Dennis
Otis Gospodnetic wrote:
Thanks Dennis.
But, hm, I don't get it 100% yet. I looked at Generator.java and I see this:
if (numLists == -1) { // for politeness make
numLists = job.getNumMapTasks(); // a partition per fetch task
}
Thus, when -numFetchers is not given, the number of fetchlists in a segment
will equal the number of map tasks (which is what I see happening). So, in
your example with 10 machines in a cluster, that -numFetchers 10 would not
really be needed, as Generator would already know to generate 10 fetchlists.
So when *does* one want to specify -numFetchers that's different (higher?) than
the number of total map tasks in a cluster as specified by the mapred.map.tasks
config in hadoop-site/default.xml?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, April 13, 2008 11:11:25 AM
Subject: Re: Parallel operations in fetch
Running generate with -numFetchers will create n number of reduce tasks
for as generator output. This is used as input for the fetchers so when
the fetcher runs it will break the job into n number of tasks. That
doesn't mean that they will *all* run in parallel. That is dependent on
the max tasks per server and the number of total servers running the job
in the hadoop cluster.
But say you have 10 machines in your cluster and you do a generate
-numFetchers 10 then they should all run in parallel.
Dennis
[EMAIL PROTECTED] wrote:
Hi,
I was able to dig out a related message/threads from "only" 3 years ago:
http://markmail.org/message/dp6a6isdboz46wez#query:+page:1+mid:o7p2iqqp66zumwcs+state:results
Is the story with running generate with -numFetchers N and running N parallel
fetch jobs still true?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Tomislav Poljak <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, April 10, 2008 2:57:04 PM
Subject: Parallel operations in fetch
Hi,
is there a way to do some of these operations in parallel safely:
generate, fetch, parse and updatedb (and if so, how)?
thanks,
Tomislav