GitHub user SaintBacchus opened a pull request:
https://github.com/apache/spark/pull/5663
[SPARK-6924][YARN] Fix driver hangs in yarn-client mode when network is
disconnected
When driver's network is disconnected for a while within yarn-client mode,
an IOException will occur in thread 'Yarn application state monitor' and cause
the driver hang forever.
To replay this scenario, you can do as follow:
* run a spark job in yarn-client mode
* Type `ifconfig {your NIC} down`
* After a while, type `ifconfig {same NIC} up`
* The `SparkSubmit` jvm process will hang forever
The exception log is about this:
```
INFO RetryInvocationHandler: Exception while invoking renewLease of class
ClientNamenodeProtocolTranslatorPB over linux-223/9.91.8.223:65110 after 12
fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.net.SocketException:
Network is unreachable;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy15.renewLease(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:879)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
at
org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
at
org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
```
Jira: https://issues.apache.org/jira/browse/HDFS-3032 may be a related
problem with it.
My solution was to catch the IOException when `renewLease` logic happened
and shutdown the spark driver.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/SaintBacchus/spark
YarnClientNetWorkUnreachable
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5663.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5663
----
commit 5a283199952b70c4d007e9da60d80fd96fb9c2a6
Author: huangzhaowei <[email protected]>
Date: 2015-04-23T12:12:08Z
When driver's netword is unreachable, an IOException will occur in thread
'Yarn application state monitor' within yarn-client mode and cause the driver
hang forever.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]