I should have mentioned this in my last email: I thought of that so I logged into every machine in the cluster; each machine's mapred-site.xml has the same md5sum.
On Wed, Feb 16, 2011 at 10:15 AM, James Seigel <[email protected]> wrote: > He might not have that conf distributed out to each machine > > > Sent from my mobile. Please excuse the typos. > > On 2011-02-16, at 9:10 AM, Kelly Burkhart <[email protected]> wrote: > >> Our clust admin (who's out of town today) has mapred.child.java.opts >> set to -Xmx1280 in mapred-site.xml. However, if I go to the job >> configuration page for a job I'm running right now, it claims this >> option is set to -Xmx200m. There are other settings in >> mapred-site.xml that are different too. Why would map/reduce jobs not >> respect the mapred-site.xml file? >> >> -K >> >> On Wed, Feb 16, 2011 at 9:43 AM, Jim Falgout <[email protected]> >> wrote: >>> You can set the amount of memory used by the reducer using the >>> mapreduce.reduce.java.opts property. Set it in mapred-site.xml or override >>> it in your job. You can set it to something like: -Xm512M to increase the >>> amount of memory used by the JVM spawned for the reducer task. >>> >>> -----Original Message----- >>> From: Kelly Burkhart [mailto:[email protected]] >>> Sent: Wednesday, February 16, 2011 9:12 AM >>> To: [email protected] >>> Subject: Re: Reduce java.lang.OutOfMemoryError >>> >>> I have had it fail with a single reducer and with 100 reducers. >>> Ultimately it needs to be funneled to a single reducer though. >>> >>> -K >>> >>> On Wed, Feb 16, 2011 at 9:02 AM, real great.. >>> <[email protected]> wrote: >>>> Hi, >>>> How many reducers are you using currently? >>>> Try increasing the number or reducers. >>>> Let me know if it helps. >>>> >>>> On Wed, Feb 16, 2011 at 8:30 PM, Kelly Burkhart >>>> <[email protected]>wrote: >>>> >>>>> Hello, I'm seeing frequent fails in reduce jobs with errors similar >>>>> to >>>>> this: >>>>> >>>>> >>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> header: attempt_201102081823_0175_m_002153_0, compressed len: 172492, >>>>> decompressed len: 172488 >>>>> 2011-02-15 15:21:10,163 FATAL org.apache.hadoop.mapred.TaskRunner: >>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure : >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf >>>>> fleInMemory(ReduceTask.java:1508) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM >>>>> apOutput(ReduceTask.java:1408) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy >>>>> Output(ReduceTask.java:1261) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( >>>>> ReduceTask.java:1195) >>>>> >>>>> 2011-02-15 15:21:10,163 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> Shuffling 172488 bytes (172492 raw bytes) into RAM from >>>>> attempt_201102081823_0175_m_002153_0 >>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> header: attempt_201102081823_0175_m_002118_0, compressed len: 161944, >>>>> decompressed len: 161940 >>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> header: attempt_201102081823_0175_m_001704_0, compressed len: 228365, >>>>> decompressed len: 228361 >>>>> 2011-02-15 15:21:10,424 INFO org.apache.hadoop.mapred.ReduceTask: >>>>> Task >>>>> attempt_201102081823_0175_r_000034_0: Failed fetch #1 from >>>>> attempt_201102081823_0175_m_002153_0 >>>>> 2011-02-15 15:21:10,424 FATAL org.apache.hadoop.mapred.TaskRunner: >>>>> attempt_201102081823_0175_r_000034_0 : Map output copy failure : >>>>> java.lang.OutOfMemoryError: Java heap space >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuf >>>>> fleInMemory(ReduceTask.java:1508) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM >>>>> apOutput(ReduceTask.java:1408) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy >>>>> Output(ReduceTask.java:1261) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( >>>>> ReduceTask.java:1195) >>>>> >>>>> Some also show this: >>>>> >>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>> at >>>>> sun.net.www.http.ChunkedInputStream.(ChunkedInputStream.java:63) >>>>> at >>>>> sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:811) >>>>> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) >>>>> at >>>>> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon >>>>> nection.java:1072) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getI >>>>> nputStream(ReduceTask.java:1447) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getM >>>>> apOutput(ReduceTask.java:1349) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy >>>>> Output(ReduceTask.java:1261) >>>>> at >>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run( >>>>> ReduceTask.java:1195) >>>>> >>>>> The particular job I'm running is an attempt to merge multiple time >>>>> series files into a single file. The job tracker shows the following: >>>>> >>>>> >>>>> Kind Num Tasks Complete Killed Failed/Killed Task Attempts >>>>> map 15795 15795 0 0 / 29 reduce 100 >>>>> 30 70 17 / 29 >>>>> >>>>> All of the files I'm reading have records with a timestamp key similar to: >>>>> >>>>> 2011-01-03 08:30:00.457000<tab><record> >>>>> >>>>> My map job is a simple python program that ignores rows with times < >>>>> 08:30:00 and > 15:00:00, determines the type of input row and writes >>>>> it to stdout with very minor modification. It maintains no state and >>>>> should not use any significant memory. My reducer is the >>>>> IdentityReducer. The input files are individually gzipped then put >>>>> into hdfs. The total uncompressed size of the output should be >>>>> around 150G. Our cluster is 32 nodes each of which has 16G RAM and >>>>> most of which have two 2T drives. We're running hadoop 0.20.2. >>>>> >>>>> >>>>> Can anyone provide some insight on how we can eliminate this issue? >>>>> I'm certain this email does not provide enough info, please let me >>>>> know what further information is needed to troubleshoot. >>>>> >>>>> Thanks in advance, >>>>> >>>>> -Kelly >>>>> >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> R.V. >>>> >>> >>> >>> >
