I'm working on a cluster with 360 reducer slots. I've got a big job, so when I launch it I follow the recommendations in the Hadoop documentation and set mapred.reduce.tasks=350, i.e. slightly less than the available number of slots.
The problem is that my reducers can still take a long time (2-4 hours) to run. So I end up grabbing a big slab of reducers and starving everybody else out. I've got my priority set to VERY_LOW and mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done everything I can do on the job parameters front. Currently there isn't a way to make the individual reducers run faster, so I'm trying to figure out the best way to run my job so that it plays nice with other users of the cluster. My rule of thumb has always been to not try and do any scheduling myself, but let Hadoop handle it for me, but I don't think that works in this scenario. Questions: 1. Am I correct in thinking that long reducer times just mess up Hadoop's scheduling granularity to a degree that it can't handle? Is 4-hour reducer outside the normal operating range of Hadoop? 2. Is there any way to stagger task launches? (Aside from manually.) 3. What if I set mapred.reduce.tasks to be some value much, much larger than the number of available reducer slots, like 100,000. Will that make the amount of work sent to each reducer smaller (hence increasing the scheduler granularity) or will it have no effect? 4. In this scenario, do I just have to reconcile myself to the fact that my job is going to squat on a block of reducers no matter what and set mapred.reduce.tasks to something much less than the available number of slots?
