Right.  So it sounds like it really makes most sense when one wants to limit 
things, not increase them.

I'm thinking back to the original question of parallelizing / overlapping 
different operations or steps in order to speed up the overall process.  It 
doesn't sound like there is anything in the generate / fetch / parse / updatedb 
process that one can overlap.  The only thing I can think of is that it 
probably pays of to have larger fetchlists, as it seems that it takes Generator 
just as much time to generate a larger fetchlist as it does to generate a small 
one.  Thus, with a larger fetchlist one at least avoids waiting for multiple 
Generator runs.

Do others have different experiences?

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 16, 2008 8:03:29 AM
Subject: Re: Parallel operations in fetch

Dennis Kubes wrote:
> True the numFetchers wouldn't be needed there, was just trying to 
> illustrate.
> 
> Although I have never used it myself (never needed to because of its 
> default behavior), I guess it could be used if you want only one machine 
> to fetch all of the urls you could do a -numFetchers 1.

There are other reasons, too. If you have a cluster with limited 
capacity (e.g. 10 map slots) and you still want to run other jobs while 
the fetcher is running, you may specify -numFetchers 2, then you keep 8 
map slots available for other jobs.

Another situation: presumably your config specifies the default number 
of map tasks equal to the cluster capacity, so when you start a fetch 
job it allocates all map slots. However, if you run some heavy plugins 
inside the Fetcher context (urlfilters, parsers, etc), you may want to 
limit the maximum amount of data in a map task by creating more map 
tasks than necessary.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply via email to