Re: Distributing Keys across Reducers

David Rosenstrauch Fri, 20 Jul 2012 07:46:29 -0700

On 07/20/2012 09:20 AM, Dave Shine wrote:

I have a job that is emitting over 3 billion rows from the map to the reduce.  
The job is configured with 43 reduce tasks.  A perfectly even distribution 
would amount to about 70 million rows per reduce task.  However I actually got 
around 60 million for most of the tasks, one task got over 100 million, and one 
task got almost 350 million.  This uneven distribution caused the job to run 
exceedingly long.


I believe this is referred to as a "key skew problem", which I know is heavily 
dependent on the actual data being processed.  Can anyone point me to any blog posts, 
white papers, etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it withyour own. This lets you write a custom partitioning scheme whichdistributes your data more evenly.


HTH,

DR

Re: Distributing Keys across Reducers

Reply via email to