For the sake of http://xkcd.com/979/, and since this was cross posted, Håvard managed to solve this specific issue via Joey's response at https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2
2012/2/14 Håvard Wahl Kongsgård <[email protected]>: > My environment heap size varies from 18GB to 2GB > in mapred-site.xml mapred.child.java.opts = -Xmx512M > > System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of > hadoop > > > This log from the tasklog > Original exception was: > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376) > at > org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) > at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) > at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > Caused by: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212) > at > org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152) > at > org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51) > at > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418) > > > I don't have a recursive loop like while or something else > > my dumbo code > > multi_tree() is just a simple function > > where the error handling is > try: > except: > pass > > def mapper(key, value): > v = value.split(" ")[0] > yield multi_tree(v),1 > > > if __name__ == "__main__": > import dumbo > dumbo.run(mapper) > > > -Håvard > > > On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[email protected]> wrote: >> Hi, >> >> What threw the heap error? Was it the Java VM, or the shell environment? >> >> It would be good to look at free RAM memory on your system before and after >> you ran the script as well, to see if your system is running low on memory. >> >> Are you using a recursive loop in your script? >> >> Thanks, >> Rohit >> >> >> Rohit Bakhshi >> >> >> >> >> >> www.hortonworks.com (http://www.hortonworks.com/) >> >> >> >> >> >> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote: >> >>> Hi, I originally posted this on the dumbo forum, but it's more a >>> general scripting hadoop issue. >>> >>> When testing a simple script that created some local files >>> and then copied them to hdfs >>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json >>> /tmp/bio_sci/file.json") >>> >>> the tasks fail with out of heap memory. The files are tiny, and I have >>> tried increasing the >>> heap size. When skipping the hadoop dfs -put, the tasks do not fail. >>> >>> Is it wrong to use hadoop dfs -put inside running a script with >>> hadoop? Should I instead >>> transfer the files at the end with a combiner, or simply mount hdfs >>> locally and write directly to hdfs? Any general suggestions? >>> >>> >>> -- >>> Håvard Wahl Kongsgård >>> NTNU >>> >>> http://havard.security-review.net/ >> > > > > -- > Håvard Wahl Kongsgård > NTNU > > http://havard.security-review.net/ -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
