[
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894332#comment-13894332
]
Wilfred Spiegelenburg commented on HDFS-4858:
---------------------------------------------
This fix will solve the issue for the DataNode but there is a far more generic
issue in the code: writes do not time out in the Client. The write should use
the same timeout as the read. If I do not have a rpcTimeout set and have
pingInterval the pingInterval will still cause a timeout on read (see around
line 600). The same should happen for writes.
The Client class is used by more than just the Datanode. It is also used by the
TaskTracker for example. Not having a timeout on write affects the failover of
the TaskTracker in a high availability scenario. Fixing it once for all users
of the Client would be an easier and quicker solution.
With the change that is proposed the default value for the client ping (via
IPC_CLIENT_PING_DEFAULT) is also changing from true to false. This will have
flow on effects too.
> HDFS DataNode to NameNode RPC should timeout
> --------------------------------------------
>
> Key: HDFS-4858
> URL: https://issues.apache.org/jira/browse/HDFS-4858
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
> Environment: Redhat/CentOS 6.4 64 bit Linux
> Reporter: Jagane Sundar
> Assignee: Konstantin Boudnik
> Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: HDFS-4858.patch, HDFS-4858.patch
>
>
> The DataNode is configured with ipc.client.ping false and ipc.ping.interval
> 14000. This configuration means that the IPC Client (DataNode, in this case)
> should timeout in 14000 seconds if the Standby NameNode does not respond to a
> sendHeartbeat.
> What we observe is this: If the Standby NameNode happens to reboot for any
> reason, the DataNodes that are heartbeating to this Standby get stuck forever
> while trying to sendHeartbeat. See Stack trace included below. When the
> Standby NameNode comes back up, we find that the DataNode never re-registers
> with the Standby NameNode. Thereafter failover completely fails.
> The desired behavior is that the DataNode's sendHeartbeat should timeout in
> 14 seconds, and keep retrying till the Standby NameNode comes back up. When
> it does, the DataNode should reconnect, re-register, and offer service.
> Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the
> method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to
> create the DatanodeProtocolPB object.
> Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
> Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to
> vmhost6-vm1/10.10.10.151:8020):
> State: WAITING
> Blocked count: 23843
> Waited count: 45676
> Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
> org.apache.hadoop.ipc.Client.call(Client.java:1220)
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
> sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
> sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> java.lang.reflect.Method.invoke(Method.java:597)
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
> sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
>
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
>
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
> java.lang.Thread.run(Thread.java:662)
> DataNode RPC to Standby NameNode never times out.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)