Re: Hadoop scripting when to use dfs -put

Harsh J Tue, 14 Feb 2012 06:02:20 -0800

For the sake of http://xkcd.com/979/, and since this was cross posted,
Håvard managed to solve this specific issue via Joey's response at
https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2


2012/2/14 Håvard Wahl Kongsgård <[email protected]>:
> My environment heap size varies from 18GB to 2GB
> in mapred-site.xml mapred.child.java.opts = -Xmx512M
>
> System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of 
> hadoop
>
>
> This log from the tasklog
> Original exception was:
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>        at 
> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376)
>        at 
> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
>        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at 
> org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212)
>        at 
> org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
>        at 
> org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
>        at 
> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418)
>
>
> I don't have a recursive loop like while or something else
>
> my dumbo code
>
> multi_tree() is just a simple function
>
> where the error handling is
> try:
> except:
> pass
>
> def mapper(key, value):
>   v = value.split(" ")[0]
>   yield multi_tree(v),1
>
>
> if __name__ == "__main__":
>   import dumbo
>   dumbo.run(mapper)
>
>
> -Håvard
>
>
> On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[email protected]> wrote:
>> Hi,
>>
>> What threw the heap error? Was it the Java VM, or the shell environment?
>>
>> It would be good to look at free RAM memory on your system before and after 
>> you ran the script as well, to see if your system is running low on memory.
>>
>> Are you using a recursive loop in your script?
>>
>> Thanks,
>> Rohit
>>
>>
>> Rohit Bakhshi
>>
>>
>>
>>
>>
>> www.hortonworks.com (http://www.hortonworks.com/)
>>
>>
>>
>>
>>
>> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote:
>>
>>> Hi, I originally posted this on the dumbo forum, but it's more a
>>> general scripting hadoop issue.
>>>
>>> When testing a simple script that created some local files
>>> and then copied them to hdfs
>>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json
>>> /tmp/bio_sci/file.json")
>>>
>>> the tasks fail with out of heap memory. The files are tiny, and I have
>>> tried increasing the
>>> heap size. When skipping the hadoop dfs -put, the tasks do not fail.
>>>
>>> Is it wrong to use hadoop dfs -put inside running a script with
>>> hadoop? Should I instead
>>> transfer the files at the end with a combiner, or simply mount hdfs
>>> locally and write directly to hdfs? Any general suggestions?
>>>
>>>
>>> --
>>> Håvard Wahl Kongsgård
>>> NTNU
>>>
>>> http://havard.security-review.net/
>>
>
>
>
> --
> Håvard Wahl Kongsgård
> NTNU
>
> http://havard.security-review.net/



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Re: Hadoop scripting when to use dfs -put

Reply via email to