[
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892888#comment-13892888
]
Konstantin Boudnik commented on HDFS-4858:
------------------------------------------
+1 one on the patching pending test results
> HDFS DataNode to NameNode RPC should timeout
> --------------------------------------------
>
> Key: HDFS-4858
> URL: https://issues.apache.org/jira/browse/HDFS-4858
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
> Environment: Redhat/CentOS 6.4 64 bit Linux
> Reporter: Jagane Sundar
> Assignee: Konstantin Boudnik
> Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: HDFS-4858.patch, HDFS-4858.patch
>
>
> The DataNode is configured with ipc.client.ping false and ipc.ping.interval
> 14000. This configuration means that the IPC Client (DataNode, in this case)
> should timeout in 14000 seconds if the Standby NameNode does not respond to a
> sendHeartbeat.
> What we observe is this: If the Standby NameNode happens to reboot for any
> reason, the DataNodes that are heartbeating to this Standby get stuck forever
> while trying to sendHeartbeat. See Stack trace included below. When the
> Standby NameNode comes back up, we find that the DataNode never re-registers
> with the Standby NameNode. Thereafter failover completely fails.
> The desired behavior is that the DataNode's sendHeartbeat should timeout in
> 14 seconds, and keep retrying till the Standby NameNode comes back up. When
> it does, the DataNode should reconnect, re-register, and offer service.
> Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the
> method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to
> create the DatanodeProtocolPB object.
> Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
> Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to
> vmhost6-vm1/10.10.10.151:8020):
> State: WAITING
> Blocked count: 23843
> Waited count: 45676
> Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
> org.apache.hadoop.ipc.Client.call(Client.java:1220)
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
> sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> java.lang.reflect.Method.invoke(Method.java:597)
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
> java.lang.Thread.run(Thread.java:662)
> DataNode RPC to Standby NameNode never times out.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)