Right. So it sounds like it really makes most sense when one wants to limit things, not increase them.
I'm thinking back to the original question of parallelizing / overlapping different operations or steps in order to speed up the overall process. It doesn't sound like there is anything in the generate / fetch / parse / updatedb process that one can overlap. The only thing I can think of is that it probably pays of to have larger fetchlists, as it seems that it takes Generator just as much time to generate a larger fetchlist as it does to generate a small one. Thus, with a larger fetchlist one at least avoids waiting for multiple Generator runs. Do others have different experiences? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Andrzej Bialecki <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, April 16, 2008 8:03:29 AM Subject: Re: Parallel operations in fetch Dennis Kubes wrote: > True the numFetchers wouldn't be needed there, was just trying to > illustrate. > > Although I have never used it myself (never needed to because of its > default behavior), I guess it could be used if you want only one machine > to fetch all of the urls you could do a -numFetchers 1. There are other reasons, too. If you have a cluster with limited capacity (e.g. 10 map slots) and you still want to run other jobs while the fetcher is running, you may specify -numFetchers 2, then you keep 8 map slots available for other jobs. Another situation: presumably your config specifies the default number of map tasks equal to the cluster capacity, so when you start a fetch job it allocates all map slots. However, if you run some heavy plugins inside the Fetcher context (urlfilters, parsers, etc), you may want to limit the maximum amount of data in a map task by creating more map tasks than necessary. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
