As I said in the original message bad partitioning was my original theory I have had issues with it in the past and am careful with my partitioner. It was the first thing I looked for but I do not see any evidence that the slower jobs have significantly more data than the faster ones and certainly not enough to justify a radically different running time.
On Thu, Aug 29, 2013 at 9:29 AM, Charles Baker <cba...@sdl.com> wrote: > Hi Steve. Sounds like a classic case of uneven data distribution among > the reducers. Most of your data is probably going to those 10 reducers that > are taking many hours. You may want to adjust your key and/or partitioning > strategy to better distribute the data amongst the reducers. If you’re > using a hashing type of partitioning strategy, think about using a prime > number of reducers. Primes are proven to have a more even distribution with > a hash type strategy and this alone may get you pretty far. I have no idea > what your workflow or cluster configuration is like but 300 reducers for > 300 mappers doesn’t sound right. Try using a (prime) number of reducers > that’s roughly equal to 95% of the total reducer slots allocated on the > cluster and go from there. Usually, the cluster should be configured for > less reducers than mappers. If you have 12 cores per node (HT off), try 8 > mappers and 3 reducers per node.**** > > ** ** > > Good luck!**** > > ** ** > > Chuck**** > > ** ** > > ** ** > > *From:* Steve Lewis [mailto:lordjoe2...@gmail.com] > *Sent:* Wednesday, August 28, 2013 7:48 PM > *To:* mapreduce-user > *Subject:* Some jobs seem to run forever**** > > ** ** > > I have an issue that I am running a hadoop job on a 40 node cluster with > about 300 Map tasks and about 300 reduce tasks. Most tasks complete within > 20 minutes but a few, typically less than 10 run for many hours. **** > > If they complete I see nothing to suggest that the number of bytes read or > written or the number of records read or written is significantly different > from tasks that run much faster. I sometimes see multiple attempts - > usually only two and the cluster is doing nothing else.**** > > ** ** > > Any suggested tuning? > **** > > ** ** > > **** > > > > www.sdl.com > <http://www.sdl.com/?utm_source=Email&utm_medium=Email%2BSignature&utm_campaign=SDL%2BStandard%2BEmail%2BSignature> > > *SDL PLC confidential, all rights reserved.* If you are not the intended > recipient of this mail SDL requests and requires that you delete it without > acting upon or copying any of its contents, and we further request that you > advise us. > > SDL Enterprise Technologies, Inc. - all rights reserved. The information > contained in this email may be confidential and/or legally privileged. It > has been sent for the sole use of the intended recipient(s). If you are not > the intended recipient of this mail, you are hereby notified that any > unauthorized review, use, disclosure, dissemination, distribution, or > copying of this communication, or any of its contents, is strictly > prohibited. If you have received this communication in error, please reply > to the sender and destroy all copies of the message. > Registered address: 201 Edgewater Drive, Suite 225, Wakefield, MA 01880, > USA > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com