On 2010-08-02 20:57, Scott Gonyea wrote:
Thank you very much, Adrzej.  I'm really hoping some people can share
some non-sensitive details of their setup.  I'm really curious about the
following:

The ratio of Maps to Reduces for their nutch jobs?

This depends on the job and the amount of data. The more data, the more map tasks you will have. The number of reduce tasks is fixed, and it should be set so that the sort and reduce operations per reduce task should operate on a reasonably-sized chunk of output data. Hence the recommendation to set it to something between 1x-2x the number of nodes.

(BTW, the thing about primes is to avoid task scheduling issues, esp. in presence of speculative execution... but that's another subject).


The amount of memory that they allocate to each job task?

Sufficient for the task ;) both map and reduce operate on a fixed memory budget, using on-disk iterators - so you need not to worry about the total number of records you want to process, just allocate enough memory to correctly process a tuple, with some room to spare for Hadoop task buffers and optionally a bit more if you use Combiner. All in all, I rarely see a good reason to go above 768MB, and often use less than that.

The number of simultaneous Maps/Reduces on any given host?

Depends on the host - amount of RAM/CPU. I usually use value starting from 2 maps (low end hardware) to 4 (regular hardware) to 8 (higher end hardware), whatever the low/regular/high means.. Reduce tasks include also the sorting, which is IO intensive, so I usually allocate 1-2 per node.

The number of fetcher threads they execute?

Something between 10-100. With higher values you need a dedicated DNS cache - 100 threads all looking up IP-s take their toll...


I'm giving each task 2048m of memory.

This is likely too much. Not that it really hurts if you have enough RAM... but JVMs may be actually less efficient if you give them unnecessarily huge heaps.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to