Lars Hofhansl created HDFS-4455:
-----------------------------------
Summary: Datanode sometimes gives up permanently on Namenode in HA
setup
Key: HDFS-4455
URL: https://issues.apache.org/jira/browse/HDFS-4455
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Reporter: Lars Hofhansl
Today we got ourselves into a situation where we hard killed the cluster (kill
-9 across the board on all processes) and upon restarting all DNs would
permanently give up on of the NNs in our two NN HA setup (using QJM).
The HA setup is correct (prior to this we failed over the NNs many times for
testing). Bouncing the DNs resolved the problem.
In the logs I see this exception:
{code}
2013-01-29 23:32:49,461 FATAL datanode.DataNode - Initialization failed for
block pool Block pool BP-1852726028-<ip>-1358813649047 (storage id
DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
java.io.IOException: Failed on local exception: java.io.IOException: Response
is null.; Host Details : local host is: "<host>/<ip>"; destination host is:
"<host>":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
at org.apache.hadoop.ipc.Client.call(Client.java:1164)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy10.registerDatanode(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy10.registerDatanode(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:149)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:619)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:661)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Response is null.
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:885)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)
2013-01-29 23:32:49,463 WARN datanode.DataNode - Ending block pool service
for: Block pool BP-1852726028-<ip>-1358813649047 (storage id
DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
{code}
So somehow in BPServiceActor.connectToNNAndHandshake() we made it all the way
to register(). Then failed in bpNamenode.registerDatanode(bpRegistration) with
an IOException, which is not caught and has the block pool service fail as a
whole.
No doubt that was caused by one of the NNs being a weird state. While that
happened the active NN claimed that the FS was corrupted and stayed in safe
mode, and DNs only registered with the standby DN. Failing over to the 2nd NN
and then restarting the first NN and failing did not change that.
No amount bouncing/failing over the HA NNs would have the DNs reconnect to one
of the NNs.
In BPServiceActor.register(), should we catch IOException instead of
SocketTimeoutException? That way it would continue to retry and eventually
connect to the NN.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira