[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3333:
-----------------------------------------------

    Attachment: MAPREDUCE-3333-20111108.txt

Tracked this down finally. With lots of help from Karam.

What was happening was that after MAPREDUCE-3256, we create one connection per 
container to a nodeManager and this per-container connection wasn't closed 
after its use. Soon, the number of threads created by Hadoop RPC per connection 
reaches the ulimit on the node's number of processes and java beautifully 
describes it as an out-of-memory error.

I put in a "RPC.stopProxy(obj)" call a couple of days back itself, but that 
didn't work because of the multiple layering of RPC in Yarn. It's time somebody 
cleanup that mess.

Attached patch should (finally) fix this. Cannot add in any automated tests. 
Testing on a big cluster only where this is reproducible consistently.

                
> MR AM for sort-job going out of memory
> --------------------------------------
>
>                 Key: MAPREDUCE-3333
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3333
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: MAPREDUCE-3333-20111102.txt, MAPREDUCE-3333-20111108.txt
>
>
> [~Karams] just found this. The usual sort job on a 350 node cluster hung due 
> to OutOfMemory and eventually failed after an hour instead of the usual odd 
> 20 minutes.
> {code}
> 2011-11-02 11:40:36,438 ERROR [ContainerLauncher #258] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1320233407485_0002
> _01_001434 : java.lang.reflect.UndeclaredThrowableException
>         at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:88)
>         at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:290)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: com.google.protobuf.ServiceException: java.io.IOException: Failed 
> on local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "gsbl91281.blue.ygrid.yahoo.com/98.137.101.189"; 
> destination host is: ""gsbl91525.blue.ygrid.yahoo.com":45450; 
>         at 
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
>         at $Proxy20.startContainer(Unknown Source)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:81)
>         ... 4 more
> Caused by: java.io.IOException: Failed on local exception: 
> java.io.IOException: Couldn't set up IO streams; Host Details : local host 
> is: "gsbl91281.blue.ygrid.yahoo.com/98.137.101.189"; destination host is: 
> ""gsbl91525.blue.ygrid.yahoo.com":45450; 
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:655)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1089)
>         at 
> org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
>         ... 6 more
> Caused by: java.io.IOException: Couldn't set up IO streams
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:621)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1065)
>         ... 7 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:597)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:614)
>         ... 10 more
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to