[ 
https://issues.apache.org/jira/browse/HDFS-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852875#action_12852875
 ] 

Todd Lipcon commented on HDFS-1075:
-----------------------------------

Additionally, there are some problems with the math of how timeouts are added 
together during write pipeline construction. In particular, each connector 
(client or intermediate DN) sets the timeoutValue to socketTimeout + 
READ_TIMEOUT_EXTENSION * targets.length. This allows TIMEOUT_EXTENSION millis 
for each stage of the pipeline to cascade. However, the timeout structuring 
here ends up being wrong if the connection setup time exceeds the timeout 
extension. Consider the following somewhat contrived sequence:

Assume we've set read timeout to 60 seconds and extension to 3 seconds

|time|client|DN A|DN B|
|0|connect|-|-|
|0.1|connected|-|-|
|0.2|-|connect B||
|20.2|-|connected, send op|-|
|70.2|-|-|respond to op|
|70.3|-|recv response|-|
|70.4|recv response|-|

In this case, the client will time out at time 66 and decide that DN A is the 
bad one, rather than DN B, even though DN B is the one that caused all the 
delay. The case is obviously a little bit contrived, but if timeouts are set 
lower this problem can happen in practice.

Essentially, the issue is that the timeout is a budget for the total operation, 
whereas we apply the timeout to each individual step of the operation 
individually. To fix, each connector should check the current timestamp before 
calling NetUtils.connect, and account for how much of the total time allotted 
was "used up". Then before reading the mirror status and firstBadLink, it 
should drop the socket timeout down to the remaining allotted time.

> Separately configure connect timeouts from read timeouts in data path
> ---------------------------------------------------------------------
>
>                 Key: HDFS-1075
>                 URL: https://issues.apache.org/jira/browse/HDFS-1075
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>            Reporter: Todd Lipcon
>
> The timeout configurations in the write pipeline overload the read timeout to 
> also include a connect timeout. In my experience, if a node is down it can 
> take many seconds to get back an exception connect, whereas if it is up it 
> will accept almost immediately, even if heavily loaded (the kernel listen 
> backlog picks it up very fast). So in the interest of faster dead node 
> detection from the writer perspective, the connect timeout should be 
> configured separately, usually to a much lower time than the read timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to