Jeff,

Inside the MR AM there are a number of priorities given to the various tasks 
request when handing them to the RM for scheduling.  Map tasks have a higher 
priority than reduce tasks do.  That means that the scheduler should never 
return a container for a reduce task until all outstanding map task requests 
have been satisfied.  That is the real problem that you are seeing.  I cannot 
say completely the reasoning behind this, but I assume that it is in part 
because of the dependency ordering between the two tasks.  Apparently in your 
case having the map tasks finish faster because they have more free slots does 
not make up for the extra time it takes for the reducers to launch.  Which is a 
little odd because I don't believe that we have seen the same slowdown in our 
terrasort benchmarks, but I will have to verify that.

Perhaps what you want to look into then is to vary the priority of the 
requests.  Be careful though, because the request priority is slightly abused 
to also be used as a cookie indicating the type of request.  So just setting 
the map and reduce task priorities to be the same is also going to require some 
changes in the code that matches up the assigned containers to the outstanding 
tasks.

Alternatively you could change the block size/splits so that the total number 
of map tasks + the total number of reduce tasks = the total number of slots in 
your cluster.  That way once all of the map tasks are satisfied the reduce 
tasks can be launched.  Each map task would process more data then it did 
before, so it might not be as fast as previous tasks were, but hopefully that 
will not be too bad.

--Bobby Evans

On 5/10/12 6:16 PM, "Jeffrey Buell" <jbu...@vmware.com> wrote:

I have slowstart set to 0.05 (I think that is the default).  In MR1 all of the 
reducers start when 5% of the maps finish (expected and desired behavior).  
This allows the shuffle phase to keep up with the maps, and it completes soon 
after all maps are finished.  In MR2 (YARN), one reducer starts at that point, 
and the others start slowly over time.  When the maps are all finished, not 
much of the shuffle work is done, hence the increased elapsed time.

Will slowstart=0 will behave much differently than 0.05?  But let's say 
slowstart=0 and you have 4000 map tasks, 100 reduce tasks, and 200 slots.  How 
many tasks of each kind would you expect to occupy those slots initially?  
Equal weighting to map and reduce would mean 100 of each.  Equal weighting to 
all tasks ready to run means 195 maps and 5 reduces.  That's a big difference 
in behavior.

Jeff


From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Thursday, May 10, 2012 3:50 PM
To: mapreduce-user@hadoop.apache.org
Subject: Terasort

Changing subject...



On May 10, 2012, at 3:40 PM, Jeffrey Buell wrote:


I have the right #slots to fill up memory across the cluster, and all those 
slots are filled with tasks. The problem I ran into was that the maps grabbed 
all the slots initially and the reduces had a hard time getting started.  As 
maps finished, more maps were started and only rarely was a reduce started.  I 
assume this behavior occurred because I had ~4000 map tasks in the queue, but 
only ~100 reduce tasks.  If the scheduler lumps maps and reduces together, then 
whenever a slot opens up it will almost surely be taken by a map task.  To get 
good performance I need all reduce tasks started early on, and have only map 
tasks compete for open slots.  Other apps may need different priorities between 
maps and reduces.  In any case, I don't understand how treating maps and 
reduces the same is workable.




Are you playing with YARN or MR1?



IAC, you are getting hit by 'slowstart' for reduces where-in reduces aren't 
scheduled till sufficient % of maps are completed.



Set mapred.reduce.slowstart.completed.maps to 0. (That should work for either 
MR1 or MR2).



Arun



Jeff



From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Thursday, May 10, 2012 1:27 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: max 1 mapper per node



For terasort you want to fill up your entire cluster with maps/reduces as fast 
as you can to get the best performance.



Just play with #slots.



Arun



On May 9, 2012, at 12:36 PM, Jeffrey Buell wrote:




Not to speak for Radim, but what I'm trying to achieve is performance at least 
as good as 0.20 for all cases.  That is, no regressions.  For something as 
simple as terasort, I don't think that is possible without being able to 
specify the max number of mappers/reducers per node.  As it is, I see slowdowns 
as much as 2X.  Hopefully I'm wrong and somebody will straighten me out.  But 
if I'm not, adding such a feature won't lead to bad behavior of any kind since 
the default could be set to unlimited and thus have no effect whatsoever.



I should emphasize that I support the goal of greater automation since Hadoop 
has way too many parameters and is so hard to tune.  Just not at the expense of 
performance regressions.



Jeff





We've been against these 'features' since it leads to very bad behaviour across 
the cluster with multiple apps/users etc.



What is your use-case i.e. what are you trying to achieve with this?



thanks,

Arun



On May 3, 2012, at 5:59 AM, Radim Kolar wrote:





if plugin system for AM is overkill, something simpler can be made like:

maximum number of mappers per node
maximum number of reducers per node

maximum percentage of non data local tasks
maximum percentage of rack local tasks

and set this in job properties.







--

Arun C. Murthy

Hortonworks Inc.
http://hortonworks.com/




--

Arun C. Murthy

Hortonworks Inc.
http://hortonworks.com/


Reply via email to