[
https://issues.apache.org/jira/browse/HDFS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Juan Yu reassigned HDFS-4455:
-----------------------------
Assignee: Juan Yu
> Datanode sometimes gives up permanently on Namenode in HA setup
> ---------------------------------------------------------------
>
> Key: HDFS-4455
> URL: https://issues.apache.org/jira/browse/HDFS-4455
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, ha
> Affects Versions: 2.0.2-alpha
> Reporter: Lars Hofhansl
> Assignee: Juan Yu
> Priority: Critical
>
> Today we got ourselves into a situation where we hard killed the cluster
> (kill -9 across the board on all processes) and upon restarting all DNs would
> permanently give up on of the NNs in our two NN HA setup (using QJM).
> The HA setup is correct (prior to this we failed over the NNs many times for
> testing). Bouncing the DNs resolved the problem.
> In the logs I see this exception:
> {code}
> 2013-01-29 23:32:49,461 FATAL datanode.DataNode - Initialization failed for
> block pool Block pool BP-1852726028-<ip>-1358813649047 (storage id
> DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
> java.io.IOException: Failed on local exception: java.io.IOException: Response
> is null.; Host Details : local host is: "<host>/<ip>"; destination host is:
> "<host>":8020;
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
> at org.apache.hadoop.ipc.Client.call(Client.java:1164)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> at $Proxy10.registerDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> at $Proxy10.registerDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:149)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:619)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:661)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Response is null.
> at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:885)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)
> 2013-01-29 23:32:49,463 WARN datanode.DataNode - Ending block pool service
> for: Block pool BP-1852726028-<ip>-1358813649047 (storage id
> DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
> {code}
> So somehow in BPServiceActor.connectToNNAndHandshake() we made it all the way
> to register(). Then failed in bpNamenode.registerDatanode(bpRegistration)
> with an IOException, which is not caught and has the block pool service fail
> as a whole.
> No doubt that was caused by one of the NNs being a weird state. While that
> happened the active NN claimed that the FS was corrupted and stayed in safe
> mode, and DNs only registered with the standby DN. Failing over to the 2nd NN
> and then restarting the first NN and failing did not change that.
> No amount bouncing/failing over the HA NNs would have the DNs reconnect to
> one of the NNs.
> In BPServiceActor.register(), should we catch IOException instead of
> SocketTimeoutException? That way it would continue to retry and eventually
> connect to the NN.
--
This message was sent by Atlassian JIRA
(v6.2#6252)