[
https://issues.apache.org/jira/browse/HADOOP-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409734#comment-16409734
]
Kihwal Lee commented on HADOOP-15321:
-------------------------------------
This IPC behavior was not introduced in 0.23. If one (painfully) traces through
the epic three-project-split and the re-merge in the SVN days, it will be
apparent the design is quite older (pre-0.20). As Hadoop was originally
designed for batch processing, clients were configured to retry for a long time
before giving up. Datatransfer should move on to other nodes more quickly. So
if it was a dfs client, it must be {{getReplicaVisibleLength()}}. Although the
IPC behavior was not new, 0.23 was the first release with clients calling ipc
against datanodes.
Prior to HDFS-814, which added {{getReplicaVisibleLength()}}, dfs client did
not call any IPC against datanode. I think it broke the
quick-recovery-for-data-reads design, as IPC connection handling is much more
conservative as it was primarily against namenode. This change was made to
branch-0.21 in December 2009, but was not really tested in field until 0.23,
which was released 2 years later. I think we started seeing problems after
upgrading from 1.x (formerly 0.20.205.x) to 0.23. I do not recall specifically,
but it seems HDFS-1330 was an attempt to address this.
The IPC behavior against NN has since changed with the introduction of HA. It
seems the error handling in client to datanode ipc should be made comparable to
that of data transfer. I thought the default connection timeout was 20 seconds,
but it still is not desirable to try this for 45 times. We need a way to
configure datanode ipc separately in clients. Perhaps we can simply use the
parameters for data transfer(block reads) without implicit ipc-level retries.
{{DFSInputStream}} can retry it in the same manner it does for block reads. We
just need to be careful not to leak objects.
> Reduce the RPC Client max retries on timeouts
> ---------------------------------------------
>
> Key: HADOOP-15321
> URL: https://issues.apache.org/jira/browse/HADOOP-15321
> Project: Hadoop Common
> Issue Type: Improvement
> Components: ipc
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Priority: Major
>
> Currently, the
> [default|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeysPublic.java#L379]
> number of retries when IPC client catch a {{ConnectTimeoutException}} is 45.
> This seems unreasonably high.
> Given the IPC client timeout is by default 60 seconds, if a DN host is
> shutdown the client will retry for 45 minutes until aborting. (If host is
> there but process down, it would throw a connection refused immediately,
> which is cool)
> Creating this Jira to discuss whether we can reduce that to a reasonable
> number.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]