[ 
https://issues.apache.org/jira/browse/HADOOP-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409734#comment-16409734
 ] 

Kihwal Lee commented on HADOOP-15321:
-------------------------------------

This IPC behavior was not introduced in 0.23. If one (painfully) traces through 
the epic three-project-split and the re-merge in the SVN days, it will be 
apparent the design is quite older (pre-0.20). As Hadoop was originally 
designed for batch processing, clients were configured to retry for a long time 
before giving up.  Datatransfer should move on to other nodes more quickly.  So 
if it was a dfs client, it must be {{getReplicaVisibleLength()}}. Although the 
IPC behavior was not new, 0.23 was the first release with clients calling ipc 
against datanodes.

Prior to HDFS-814, which added {{getReplicaVisibleLength()}}, dfs client did 
not call any IPC against datanode. I think it broke the 
quick-recovery-for-data-reads design, as IPC connection handling is much more 
conservative as it was primarily against namenode. This change was made to 
branch-0.21 in December 2009, but was not really tested in field until 0.23, 
which was released 2 years later. I think we started seeing problems after 
upgrading from 1.x (formerly 0.20.205.x) to 0.23. I do not recall specifically, 
but it seems HDFS-1330 was an attempt to address this.

The IPC behavior against NN has since changed with the introduction of HA. It 
seems the error handling in client to datanode ipc should be made comparable to 
that of data transfer. I thought the default connection timeout was 20 seconds, 
but it still is not desirable to try this for 45 times. We need a way to 
configure datanode ipc separately in clients. Perhaps we can simply use the 
parameters for data transfer(block reads) without implicit ipc-level retries. 
{{DFSInputStream}} can retry it in the same manner it does for block reads. We 
just need to be careful not to leak objects.

> Reduce the RPC Client max retries on timeouts
> ---------------------------------------------
>
>                 Key: HADOOP-15321
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15321
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>
> Currently, the 
> [default|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CommonConfigurationKeysPublic.java#L379]
>  number of retries when IPC client catch a {{ConnectTimeoutException}} is 45. 
> This seems unreasonably high.
> Given the IPC client timeout is by default 60 seconds, if a DN host is 
> shutdown the client will retry for 45 minutes until aborting. (If host is 
> there but process down, it would throw a connection refused immediately, 
> which is cool)
> Creating this Jira to discuss whether we can reduce that to a reasonable 
> number.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to