[
https://issues.apache.org/jira/browse/HDFS-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568031#comment-14568031
]
Chris Nauroth commented on HDFS-8510:
-------------------------------------
The current situation is problematic for rolling upgrades in deployments that
have set {{ipc.client.connect.max.retries}} and/or
{{ipc.client.connect.retry.interval}} to something higher than the default.
This command is run in situations where it is expected that the DataNode is
down, therefore the expectation is that the connection will fail. The command
can spend a lot of time in a connection retry loop. In the worst case, a
script that stops and then restarts a DataNode will have to wait so long for
the retry loop to complete that it can't restart the DataNode in time to meet
the 30-second deadline required for OOB ack response handling in the client.
Missing this deadline forces clients into pipeline recoveries, which is
sub-optimal.
To minimize surprises for existing deployments, let's set these new timeout
configuration properties to use the same default values as
{{ipc.client.connect.max.retries}} and {{ipc.client.connect.retry.interval}}.
> Provide different timeout settings for hdfs dfsadmin -getDatanodeInfo.
> ----------------------------------------------------------------------
>
> Key: HDFS-8510
> URL: https://issues.apache.org/jira/browse/HDFS-8510
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: tools
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
>
> During a rolling upgrade, an administrator runs {{hdfs dfsadmin
> -getDatanodeInfo}} to check if a DataNode has stopped. Currently, this
> operation is subject to the RPC connection retries defined in
> {{ipc.client.connect.max.retries}} and {{ipc.client.connect.retry.interval}}.
> This issue proposes adding separate configuration properties to control the
> retries for this operation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)