Re: Hadoop scripting when to use dfs -put

Håvard Wahl Kongsgård Wed, 15 Feb 2012 04:13:47 -0800

Sorry for cross posting again. There is still something strange with
the dfs client and python. With the very simple code below, I get no
errors, but no output in /tmp/bio_sci/


I could use FUSE, but this issue should be of general interest to
users of hadoop/ python users. Can anyone replicate this?

def multi_tree(value):
    os.system("hadoop dfs -touchz /tmp/bio_sci/"+str(value)+" >
/dev/null 2> /dev/null")

def mapper(key, value):
    v = value.split(" ")[0]
    yield multi_tree(v),1
if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper)

-Håvard


On Tue, Feb 14, 2012 at 3:01 PM, Harsh J <[email protected]> wrote:
> For the sake of http://xkcd.com/979/, and since this was cross posted,
> Håvard managed to solve this specific issue via Joey's response at
> https://groups.google.com/a/cloudera.org/group/cdh-user/msg/c55760868efa32e2
>
> 2012/2/14 Håvard Wahl Kongsgård <[email protected]>:
>> My environment heap size varies from 18GB to 2GB
>> in mapred-site.xml mapred.child.java.opts = -Xmx512M
>>
>> System Ubuntu 10.04 LTS, java-6-sun-1.6.0.26, ,latest cloudera version of 
>> hadoop
>>
>>
>> This log from the tasklog
>> Original exception was:
>> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>>        at 
>> org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:376)
>>        at 
>> org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
>>        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
>>        at 
>> org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at 
>> org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:212)
>>        at 
>> org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
>>        at 
>> org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:51)
>>        at 
>> org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:418)
>>
>>
>> I don't have a recursive loop like while or something else
>>
>> my dumbo code
>>
>> multi_tree() is just a simple function
>>
>> where the error handling is
>> try:
>> except:
>> pass
>>
>> def mapper(key, value):
>>   v = value.split(" ")[0]
>>   yield multi_tree(v),1
>>
>>
>> if __name__ == "__main__":
>>   import dumbo
>>   dumbo.run(mapper)
>>
>>
>> -Håvard
>>
>>
>> On Mon, Feb 13, 2012 at 8:52 PM, Rohit <[email protected]> wrote:
>>> Hi,
>>>
>>> What threw the heap error? Was it the Java VM, or the shell environment?
>>>
>>> It would be good to look at free RAM memory on your system before and after 
>>> you ran the script as well, to see if your system is running low on memory.
>>>
>>> Are you using a recursive loop in your script?
>>>
>>> Thanks,
>>> Rohit
>>>
>>>
>>> Rohit Bakhshi
>>>
>>>
>>>
>>>
>>>
>>> www.hortonworks.com (http://www.hortonworks.com/)
>>>
>>>
>>>
>>>
>>>
>>> On Monday, February 13, 2012 at 10:39 AM, Håvard Wahl Kongsgård wrote:
>>>
>>>> Hi, I originally posted this on the dumbo forum, but it's more a
>>>> general scripting hadoop issue.
>>>>
>>>> When testing a simple script that created some local files
>>>> and then copied them to hdfs
>>>> with os.system("hadoop dfs -put /home/havard/bio_sci/file.json
>>>> /tmp/bio_sci/file.json")
>>>>
>>>> the tasks fail with out of heap memory. The files are tiny, and I have
>>>> tried increasing the
>>>> heap size. When skipping the hadoop dfs -put, the tasks do not fail.
>>>>
>>>> Is it wrong to use hadoop dfs -put inside running a script with
>>>> hadoop? Should I instead
>>>> transfer the files at the end with a combiner, or simply mount hdfs
>>>> locally and write directly to hdfs? Any general suggestions?
>>>>
>>>>
>>>> --
>>>> Håvard Wahl Kongsgård
>>>> NTNU
>>>>
>>>> http://havard.security-review.net/
>>>
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> NTNU
>>
>> http://havard.security-review.net/
>
>
>
> --
> Harsh J
> Customer Ops. Engineer
> Cloudera | http://tiny.cloudera.com/about



-- 
Håvard Wahl Kongsgård
NTNU

http://havard.security-review.net/

Re: Hadoop scripting when to use dfs -put

Reply via email to