The idea to enhance MapReduce to resolve the skew problem

易剑 Thu, 04 Feb 2010 00:11:16 -0800

Currently, only map tasks are balanced, and reduce tasks possible are skew,
the timeslice is also different, which lead the scheduler is not smart. I
have an idea to improve it.


We can break the output of map to N*M splits, N is the number of nodes, and
M >=1，and regroup to new splits bycombining the smaller splits and
resplitting the bigger splits, until the size of every splits is balanced
with the specified value.

There are three cases:
1. Too many values for a key
2. Too many keys hash to a partition
3. Every partition is balanced in the size

If too many values for a key, adding a new MapReduce procedure is necessary.
If too many keys hash to a partition, resplitting is necessary.

If every splitting is balanced, we can consider a task (map or reduce) to a
scheduler timeslice, the scheduler will be smart like OS's scheduler.

The idea to enhance MapReduce to resolve the skew problem

Reply via email to