W.P., Hard to help out without knowing more about the characteristics of your data? How many keys are you expecting? How many values per key?
Cheers James. On 2011-05-18, at 3:25 PM, W.P. McNeill wrote: > I'm working on a cluster with 360 reducer slots. I've got a big job, so > when I launch it I follow the recommendations in the Hadoop documentation > and set mapred.reduce.tasks=350, i.e. slightly less than the available > number of slots. > > The problem is that my reducers can still take a long time (2-4 hours) to > run. So I end up grabbing a big slab of reducers and starving everybody > else out. I've got my priority set to VERY_LOW and > mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done > everything I can do on the job parameters front. Currently there isn't a > way to make the individual reducers run faster, so I'm trying to figure out > the best way to run my job so that it plays nice with other users of the > cluster. My rule of thumb has always been to not try and do any scheduling > myself, but let Hadoop handle it for me, but I don't think that works in > this scenario. > > Questions: > > 1. Am I correct in thinking that long reducer times just mess up Hadoop's > scheduling granularity to a degree that it can't handle? Is 4-hour reducer > outside the normal operating range of Hadoop? > 2. Is there any way to stagger task launches? (Aside from manually.) > 3. What if I set mapred.reduce.tasks to be some value much, much larger > than the number of available reducer slots, like 100,000. Will that make > the amount of work sent to each reducer smaller (hence increasing the > scheduler granularity) or will it have no effect? > 4. In this scenario, do I just have to reconcile myself to the fact that > my job is going to squat on a block of reducers no matter what and > set mapred.reduce.tasks > to something much less than the available number of slots?
