Jagane Sundar created HDFS-4858:
-----------------------------------
Summary: HDFS DataNode to NameNode RPC should timeout
Key: HDFS-4858
URL: https://issues.apache.org/jira/browse/HDFS-4858
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.0.4-alpha, 3.0.0, 2.0.5-beta, 2.0.4.1-alpha
Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
Fix For: 3.0.0, 2.0.5-beta
The DataNode is configured with ipc.client.ping false and ipc.ping.interval
14000. This configuration means that the IPC Client (DataNode, in this case)
should timeout in 14000 seconds if the Standby NameNode does not respond to a
sendHeartbeat.
What we observe is this: If the Standby NameNode happens to reboot for any
reason, the DataNodes that are heartbeating to this Standby get stuck forever
while trying to sendHeartbeat. See Stack trace included below. When the Standby
NameNode comes back up, we find that the DataNode never re-registers with the
Standby NameNode. Thereafter failover completely fails.
The desired behavior is that the DataNode's sendHeartbeat should timeout in 14
seconds, and keep retrying till the Standby NameNode comes back up. When it
does, the DataNode should reconnect, re-register, and offer service.
Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the
method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to
create the DatanodeProtocolPB object.
Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to
vmhost6-vm1/10.10.10.151:8020):
State: WAITING
Blocked count: 23843
Waited count: 45676
Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.ipc.Client.call(Client.java:1220)
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
java.lang.Thread.run(Thread.java:662)
DataNode RPC to Standby NameNode never times out.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira