If you google for such memory failures, you'll find the mapreduce tunable that'll help you:
mapred.job.shuffle.input.buffer.percent ; it is well known that the default values in hadoop config don't work well for large data systems -Rahul On Wed, Feb 16, 2011 at 10:36 AM, James Seigel <[email protected]> wrote: > Good luck. > > Let me know how it goes. > > James > > Sent from my mobile. Please excuse the typos. > > On 2011-02-16, at 11:11 AM, Kelly Burkhart <[email protected]> > wrote: > > > OK, the job was preferring the config file on my local machine which > > is not part of the cluster over the cluster config files. That seems > > completely broken to me; my config was basically empty other than > > containing the location of the cluster and my job apparently used > > defaults rather than the cluster config. It doesn't make sense to me > > to keep configuration files synchronized on every machine that may > > access the cluster. > > > > I'm running again; we'll see if it completes this time. > > > > -K > > > > On Wed, Feb 16, 2011 at 10:30 AM, James Seigel <[email protected]> wrote: > >> Hrmmm. Well as you've pointed out. 200m is quite small and is probably > >> the cause. > >> > >> Now thEre might be some overriding settings in something you are using > >> to launch or something. > >> > >> You could set those values in the config to not be overridden in the > >> main conf then see what tries to override it in the logs > >> > >> Cheers > >> James > >> > >> Sent from my mobile. Please excuse the typos. > >> > >> On 2011-02-16, at 9:21 AM, Kelly Burkhart <[email protected]> > wrote: > >> > >>> I should have mentioned this in my last email: I thought of that so I > >>> logged into every machine in the cluster; each machine's > >>> mapred-site.xml has the same md5sum. > >>> > >>> On Wed, Feb 16, 2011 at 10:15 AM, James Seigel <[email protected]> wrote: > >>>> He might not have that conf distributed out to each machine > >>>> > >>>> > >>>> Sent from my mobile. Please excuse the typos. > >>>> > >>>> On 2011-02-16, at 9:10 AM, Kelly Burkhart <[email protected]> > wrote: > >>>> > >>>>> Our clust admin (who's out of town today) has mapred.child.java.opts > >>>>> set to -Xmx1280 in mapred-site.xml. However, if I go to the job > >>>>> configuration page for a job I'm running right now, it claims this > >>>>> option is set to -Xmx200m. There are other settings in > >>>>> mapred-site.xml that are different too. Why would map/reduce jobs > not > >>>>> respect the mapred-site.xml file? > >>>>> > >>>>> -K > >>>>> > >>>>> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout < > [email protected]> wrote: > >>>>>> You can set the amount of memory used by the reducer using the > mapreduce.reduce.java.opts property. Set it in mapred-site.xml or override > it in your job. You can set it to something like: -Xm512M to increase the > amount of memory used by the JVM spawned for the reducer task. > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Kelly Burkhart [mailto:[email protected]] > >>>>>> Sent: Wednesday, February 16, 2011 9:12 AM > >>>>>> To: [email protected] > >>>>>> Subject: Re: Reduce java.lang.OutOfMemoryError > >>>>>> > >>>>>> I have had it fail with a single reducer and with 100 reducers. > >>>>>> Ultimately it needs to be funneled to a single reducer though. > >>>>>> > >>>>>> -K > >>>>>> > >>>>>> On Wed, Feb 16, 2011 at 9:02 AM, real great.. > >>>>>> <[email protected]> wrote: > >>>>>>> Hi, > >>>>>>> How many reducers are you using currently? > >>>>>>> Try increasing the number or reducers. > >>>>>>> Let me know if it helps. > >>>>>>> > >>>>>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart < > [email protected]>wrote: > >>>>>>> > >>>>>>>> Hello, I'm seeing frequent fails in reduce jobs with errors > similar > >>>>>>>> to > >>>>>>>> this: > >>>>>>>> > >>>>>>>> > >>>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask: > >>>>>>>> header: attempt_201102081823_0175_m_002153_0, compressed len: > 172492, > >>>>>>>> decompressed len: 172488 > >>>>>>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner: > >>>>>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure : > >>>>>>>> java.lang.OutOfMemoryError: Java heap space > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf > >>>>>>>> fleInMemory(ReduceTask.java:1508) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM > >>>>>>>> apOutput(ReduceTask.java:1408) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy > >>>>>>>> Output(ReduceTask.java:1261) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( > >>>>>>>> ReduceTask.java:1195) > >>>>>>>> > >>>>>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask: > >>>>>>>> Shuffling 172488 bytes (172492 raw bytes) into RAM from > >>>>>>>> attempt_201102081823_0175_m_002153_0 > >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: > >>>>>>>> header: attempt_201102081823_0175_m_002118_0, compressed len: > 161944, > >>>>>>>> decompressed len: 161940 > >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: > >>>>>>>> header: attempt_201102081823_0175_m_001704_0, compressed len: > 228365, > >>>>>>>> decompressed len: 228361 > >>>>>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: > >>>>>>>> Task > >>>>>>>> attempt_201102081823_0175_r_000034_0: Failed fetch #1 from > >>>>>>>> attempt_201102081823_0175_m_002153_0 > >>>>>>>> 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner: > >>>>>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure : > >>>>>>>> java.lang.OutOfMemoryError: Java heap space > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf > >>>>>>>> fleInMemory(ReduceTask.java:1508) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM > >>>>>>>> apOutput(ReduceTask.java:1408) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy > >>>>>>>> Output(ReduceTask.java:1261) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( > >>>>>>>> ReduceTask.java:1195) > >>>>>>>> > >>>>>>>> Some also show this: > >>>>>>>> > >>>>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded > >>>>>>>> at > >>>>>>>> sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63) > >>>>>>>> at > >>>>>>>> sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811) > >>>>>>>> at > sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) > >>>>>>>> at > >>>>>>>> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon > >>>>>>>> nection.java:1072) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getI > >>>>>>>> nputStream(ReduceTask.java:1447) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM > >>>>>>>> apOutput(ReduceTask.java:1349) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy > >>>>>>>> Output(ReduceTask.java:1261) > >>>>>>>> at > >>>>>>>> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( > >>>>>>>> ReduceTask.java:1195) > >>>>>>>> > >>>>>>>> The particular job I'm running is an attempt to merge multiple > time > >>>>>>>> series files into a single file. The job tracker shows the > following: > >>>>>>>> > >>>>>>>> > >>>>>>>> Kind Num Tasks Complete Killed Failed/Killed Task > Attempts > >>>>>>>> map 15795 15795 0 0 / 29 reduce 100 > >>>>>>>> 30 70 17 / 29 > >>>>>>>> > >>>>>>>> All of the files I'm reading have records with a timestamp key > similar to: > >>>>>>>> > >>>>>>>> 2011-01-03 08:30:00.457000<tab><record> > >>>>>>>> > >>>>>>>> My map job is a simple python program that ignores rows with times > < > >>>>>>>> 08:30:00 and > 15:00:00, determines the type of input row and > writes > >>>>>>>> it to stdout with very minor modification. It maintains no state > and > >>>>>>>> should not use any significant memory. My reducer is the > >>>>>>>> IdentityReducer. The input files are individually gzipped then > put > >>>>>>>> into hdfs. The total uncompressed size of the output should be > >>>>>>>> around 150G. Our cluster is 32 nodes each of which has 16G RAM > and > >>>>>>>> most of which have two 2T drives. We're running hadoop 0.20.2. > >>>>>>>> > >>>>>>>> > >>>>>>>> Can anyone provide some insight on how we can eliminate this > issue? > >>>>>>>> I'm certain this email does not provide enough info, please let me > >>>>>>>> know what further information is needed to troubleshoot. > >>>>>>>> > >>>>>>>> Thanks in advance, > >>>>>>>> > >>>>>>>> -Kelly > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Regards, > >>>>>>> R.V. > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>> > >> >
