Dennis Kubes wrote:
True the numFetchers wouldn't be needed there, was just trying to
illustrate.
Although I have never used it myself (never needed to because of its
default behavior), I guess it could be used if you want only one machine
to fetch all of the urls you could do a -numFetchers 1.
There are other reasons, too. If you have a cluster with limited
capacity (e.g. 10 map slots) and you still want to run other jobs while
the fetcher is running, you may specify -numFetchers 2, then you keep 8
map slots available for other jobs.
Another situation: presumably your config specifies the default number
of map tasks equal to the cluster capacity, so when you start a fetch
job it allocates all map slots. However, if you run some heavy plugins
inside the Fetcher context (urlfilters, parsers, etc), you may want to
limit the maximum amount of data in a map task by creating more map
tasks than necessary.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com