Hi, Do you mean do resplitting and recombining in each mapper task ? I am sure what the purpose, as my understanding, the Partitioner determine which reducer the output of mapper task go. So I don't think you method can solve the skew problem.
2010/2/4 易剑 <[email protected]> > Currently, only map tasks are balanced, and reduce tasks possible are skew, > the timeslice is also different, which lead the scheduler is not smart. I > have an idea to improve it. > > We can break the output of map to N*M splits, N is the number of nodes, and > M >=1,and regroup to new splits bycombining the smaller splits and > resplitting the bigger splits, until the size of every splits is balanced > with the specified value. > > There are three cases: > 1. Too many values for a key > 2. Too many keys hash to a partition > 3. Every partition is balanced in the size > > If too many values for a key, adding a new MapReduce procedure is > necessary. > If too many keys hash to a partition, resplitting is necessary. > > If every splitting is balanced, we can consider a task (map or reduce) to a > scheduler timeslice, the scheduler will be smart like OS's scheduler. > -- Best Regards Jeff Zhang
