Good to know. Thanks for the update. - Tim.
On Jul 25, 2012, at 5:21 AM, "Dave Shine" <dave.sh...@channelintelligence.com> wrote: > Just wanted to follow up on this issue. It turned out that I was overlooking > the obvious. Turns out that over 8% of the mapper output had exactly the > same key, which was actually an invalid value. By changing the mapper to not > emit records with an invalid key the problem went away. > > Moral of the story, verify the data before you blame the software. > > Dave Shine > Sr. Software Engineer > 321.939.5093 direct | 407.314.0122 mobile > CI BoostT Clients Outperform OnlineT www.ciboost.com > > > -----Original Message----- > From: Dave Shine [mailto:dave.sh...@channelintelligence.com] > Sent: Friday, July 20, 2012 1:13 PM > To: mapreduce-user@hadoop.apache.org > Subject: RE: Distributing Keys across Reducers > > Yes, that is a possibility, but it will take some significant rearchitecture. > I was assuming that was what I was going to have to do until I saw the key > distribution problem and though I might be able to buy some relief by > addressing that. > > The job runs once per day, starting at 1:00AM EDT. I have changed it to use > a fewer number of reducers just to see how that effects the distribution. > > Dave Shine > Sr. Software Engineer > 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform > Online(tm) www.ciboost.com > > > -----Original Message----- > From: Tim Broberg [mailto:tim.brob...@exar.com] > Sent: Friday, July 20, 2012 1:03 PM > To: mapreduce-user@hadoop.apache.org > Subject: RE: Distributing Keys across Reducers > > Just a thought, but can you deal with the problem with increased granularity > by simply making the jobs smaller? > > If you have enough jobs, when one takes twice as long there will be plenty of > other small jobs to employ the other nodes, right? > > - Tim. > > ________________________________________ > From: David Rosenstrauch [dar...@darose.net] > Sent: Friday, July 20, 2012 7:45 AM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Distributing Keys across Reducers > > On 07/20/2012 09:20 AM, Dave Shine wrote: >> I have a job that is emitting over 3 billion rows from the map to the >> reduce. The job is configured with 43 reduce tasks. A perfectly even >> distribution would amount to about 70 million rows per reduce task. However >> I actually got around 60 million for most of the tasks, one task got over >> 100 million, and one task got almost 350 million. This uneven distribution >> caused the job to run exceedingly long. >> >> I believe this is referred to as a "key skew problem", which I know is >> heavily dependent on the actual data being processed. Can anyone point me >> to any blog posts, white papers, etc. that might give me some options on how >> to deal with this issue? > > Hadoop lets you override the default partitioner and replace it with your > own. This lets you write a custom partitioning scheme which distributes your > data more evenly. > > HTH, > > DR > > The information contained in this email is intended only for the personal and > confidential use of the recipient(s) named above. The information and any > attached documents contained in this message may be Exar confidential and/or > legally privileged. If you are not the intended recipient, you are hereby > notified that any review, use, dissemination or reproduction of this message > is strictly prohibited and may be unlawful. If you have received this > communication in error, please notify us immediately by return email and > delete the original message. > > The information contained in this email message is considered confidential > and proprietary to the sender and is intended solely for review and use by > the named recipient. Any unauthorized review, use or distribution is strictly > prohibited. If you have received this message in error, please advise the > sender by reply email and delete the message.