[ 
https://issues.apache.org/jira/browse/HDFS-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364604#comment-14364604
 ] 

Todd Lipcon commented on HDFS-5522:
-----------------------------------

I know this has been closed for a while, but wanted to get some clarification:

{quote}
If I/O hangs for a long time, network read/write may timeout first and the peer 
may close the connection. Although the error was caused by a faulty local disk, 
disk check is not being carried out in this case
{quote}

It seems like this JIRA applied a rather heavy hammer for a specific case that 
could be better identified in another way. After applying this patch, it seems 
that DNs will run checkDiskError when any other node experiences an issue. So, 
if one node is down (eg due to a rolling restart or a crash) all of the other 
nodes are very soon running checkDiskError for no particularly good reason. 
Coupled with HDFS-7489, this failure can also cascade.

Can you describe in more detail the scenario you were facing that inspired this 
JIRA? Would it not make more sense to actually look for the underlying symptom 
(by adding a timer around the I/O, perhaps?) and running checkDiskError in the 
specific scenarios we're looking for, rather than all network errors?

> Datanode disk error check may be incorrectly skipped
> ----------------------------------------------------
>
>                 Key: HDFS-5522
>                 URL: https://issues.apache.org/jira/browse/HDFS-5522
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.23.9, 2.2.0
>            Reporter: Kihwal Lee
>            Assignee: Rushabh S Shah
>             Fix For: 2.5.0
>
>         Attachments: HDFS-5522-v2.patch, HDFS-5522-v3.patch, HDFS-5522.patch
>
>
> After HDFS-4581 and HDFS-4699, {{checkDiskError()}} is not called when 
> network errors occur during processing data node requests.  This appears to 
> create problems when a disk is having problems, but not failing I/O soon. 
> If I/O hangs for a long time, network read/write may timeout first and the 
> peer may close the connection. Although the error was caused by a faulty 
> local disk, disk check is not being carried out in this case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to