On Jun 28, 2011, at 1:43 PM, Peter Wolf wrote:

> Hello all,
> 
> I am looking for the right thing to read...
> 
> I am writing a MapReduce Speech Recognition application.  I want to run many 
> Speech Recognizers in parallel.
> 
> Speech Recognizers not only use a large amount of processor, they also use a 
> large amount of memory.  Also, in my application, they are often idle much of 
> the time waiting for data.  So optimizing what runs when is non-trivial.
> 
> I am trying to better understand how Hadoop manages resources.  Does it 
> automatically figure out the right number of mappers to instantiate?

        The number of mappers correlates to the number of InputSplits, which is 
based upon the InputFormat.  In most cases, this is equivalent to the number of 
blocks.  So if a file is composed of 3 blocks, it will generate 3 mappers.  
Again, depending upon the InputFormat, the size of these splits may be 
manipulated via job settings.


>  How?  What happens when other people are sharing the cluster?  What resource 
> management is the responsibility of application developers?

        Realistically, *all* resource management is the responsibility of the 
operations and development teams.  The only real resource protection/allocation 
system that Hadoop provides is task slots and, if enabled, some memory 
protection in the form of "don't go over this much".    On multi-tenant 
systems, a good neighbor view of the world should be adopted.

> For example, let's say each Speech Recognizer uses 500 MB, and I have 
> 1,000,000 files to process.  What would happen if I made 1,000,000 mappers, 
> each with 1 Speech Recognizer?  

        At 1m mappers, the JobTracker would likely explode under the weight 
first unless the Heap size was raised significantly.  Each value that you see 
on the JT page--including those for each task--are kept in main memory.  


> Is it only non-optimal because of setup time, or would the system try to 
> allocate 500GB of memory and explode?

        If you have 1m map slots, yes, it would allocate .5TB of mem spread 
across each node.

Reply via email to