GitHub user SaintBacchus opened a pull request:

    https://github.com/apache/spark/pull/5663

    [SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is 
disconnected

    When driver's network is disconnected for a while within yarn-client mode, 
an IOException will occur in thread 'Yarn application state monitor'  and cause 
the driver hang forever.
    
    To replay this scenario, you can do as follow:
    * run a spark job in yarn-client mode
    * Type `ifconfig {your NIC} down`
    * After a while, type `ifconfig {same NIC} up`
    * The `SparkSubmit` jvm process will hang forever
    
    The exception log is about this:
    ```
    INFO RetryInvocationHandler: Exception while invoking renewLease of class 
ClientNamenodeProtocolTranslatorPB over linux-223/9.91.8.223:65110 after 12 
fail over attempts. Trying to fail over immediately.
    java.io.IOException: Failed on local exception: java.net.SocketException: 
Network is unreachable; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
            at org.apache.hadoop.ipc.Client.call(Client.java:1472)
            at org.apache.hadoop.ipc.Client.call(Client.java:1399)
            at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
            at com.sun.proxy.$Proxy15.renewLease(Unknown Source)
            at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
            at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
            at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:606)
            at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
            at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
            at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:879)
            at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
            at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
            at 
org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
            at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketException: Network is unreachable
            at sun.nio.ch.Net.connect0(Native Method)
            at sun.nio.ch.Net.connect(Net.java:465)
            at sun.nio.ch.Net.connect(Net.java:457)
            at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
            at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
            at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
            at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
            at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
            at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
            at 
org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
            at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
            at org.apache.hadoop.ipc.Client.call(Client.java:1438)
    ```
    Jira: https://issues.apache.org/jira/browse/HDFS-3032 may be a related 
problem with it.
    My solution was to catch the IOException when  `renewLease` logic happened 
and shutdown the spark driver.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SaintBacchus/spark 
YarnClientNetWorkUnreachable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5663.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5663
    
----
commit 5a283199952b70c4d007e9da60d80fd96fb9c2a6
Author: huangzhaowei <[email protected]>
Date:   2015-04-23T12:12:08Z

    When driver's netword is unreachable, an IOException will occur in thread 
'Yarn application state monitor' within yarn-client mode and cause the driver 
hang forever.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to