[ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-------------------------------------

    Status: Open  (was: Patch Available)

> HDFS DataNode to NameNode RPC should timeout
> --------------------------------------------
>
>                 Key: HDFS-4858
>                 URL: https://issues.apache.org/jira/browse/HDFS-4858
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0
>         Environment: Redhat/CentOS 6.4 64 bit Linux
>            Reporter: Jagane Sundar
>            Assignee: Jagane Sundar
>            Priority: Minor
>             Fix For: 3.0.0, 2.3.0
>
>         Attachments: HDFS-4858.patch, HDFS-4858.patch
>
>
> The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
> 14000. This configuration means that the IPC Client (DataNode, in this case) 
> should timeout in 14000 seconds if the Standby NameNode does not respond to a 
> sendHeartbeat.
> What we observe is this: If the Standby NameNode happens to reboot for any 
> reason, the DataNodes that are heartbeating to this Standby get stuck forever 
> while trying to sendHeartbeat. See Stack trace included below. When the 
> Standby NameNode comes back up, we find that the DataNode never re-registers 
> with the Standby NameNode. Thereafter failover completely fails.
> The desired behavior is that the DataNode's sendHeartbeat should timeout in 
> 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
> it does, the DataNode should reconnect, re-register, and offer service.
> Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
> method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
> create the DatanodeProtocolPB object.
> Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
> Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
> vmhost6-vm1/10.10.10.151:8020):
>   State: WAITING
>   Blocked count: 23843
>   Waited count: 45676
>   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
>   Stack:
>     java.lang.Object.wait(Native Method)
>     java.lang.Object.wait(Object.java:485)
>     org.apache.hadoop.ipc.Client.call(Client.java:1220)
>     
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>     sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>     sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>     
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     java.lang.reflect.Method.invoke(Method.java:597)
>     
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>     
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>     sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>     
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
>     
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
>     
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
>     
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
>     java.lang.Thread.run(Thread.java:662)
> DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to