Hi, There are two purpose: 1.Load balance both map and reduce task to solve the skew problem 2.By controling the scale of a task, a task can be regarded as a timeslice of scheduling
The first is the precondition of the second. How to solve the skew problem? I'll describe it detailed before long. I think it is feasible. 2010/2/4 Jeff Zhang <[email protected]> > Hi, > > Do you mean do resplitting and recombining in each mapper task ? I am sure > what the purpose, as my understanding, the Partitioner determine which > reducer the output of mapper task go. So I don't think you method can solve > the skew problem. > > > 2010/2/4 易剑 <[email protected]> > > > Currently, only map tasks are balanced, and reduce tasks possible are > skew, > > the timeslice is also different, which lead the scheduler is not smart. I > > have an idea to improve it. > > > > We can break the output of map to N*M splits, N is the number of nodes, > and > > M >=1,and regroup to new splits bycombining the smaller splits and > > resplitting the bigger splits, until the size of every splits is balanced > > with the specified value. > > > > There are three cases: > > 1. Too many values for a key > > 2. Too many keys hash to a partition > > 3. Every partition is balanced in the size > > > > If too many values for a key, adding a new MapReduce procedure is > > necessary. > > If too many keys hash to a partition, resplitting is necessary. > > > > If every splitting is balanced, we can consider a task (map or reduce) to > a > > scheduler timeslice, the scheduler will be smart like OS's scheduler. > > > > > > -- > Best Regards > > Jeff Zhang > -- Hadoop Technology Forum http://bbs.hadoopor.com http://www.hadoopor.com http://forum.hadoopor.com
