[
https://issues.apache.org/jira/browse/HDFS-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150220#comment-15150220
]
Vinayakumar B commented on HDFS-5522:
-------------------------------------
bq. So, if one node is down (eg due to a rolling restart or a crash) all of the
other nodes are very soon running checkDiskError for no particularly good
reason. Coupled with HDFS-7489, this failure can also cascade
Yes, samething has been experienced in one of our customer's cluster.
Due to some nodes' n/w issue, all other datanodes (connected in pipeline)
started checkdisk. And without HDFS-8845 (2.7.2), all Datanode's disk I/O hit
100%.
By the time first round of diskcheck is done, some other exception requested
for diskcheck again. This continued for more than 40 hours slowing down every
other application.
> Datanode disk error check may be incorrectly skipped
> ----------------------------------------------------
>
> Key: HDFS-5522
> URL: https://issues.apache.org/jira/browse/HDFS-5522
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.23.9, 2.2.0
> Reporter: Kihwal Lee
> Assignee: Rushabh S Shah
> Fix For: 2.5.0
>
> Attachments: HDFS-5522-v2.patch, HDFS-5522-v3.patch, HDFS-5522.patch
>
>
> After HDFS-4581 and HDFS-4699, {{checkDiskError()}} is not called when
> network errors occur during processing data node requests. This appears to
> create problems when a disk is having problems, but not failing I/O soon.
> If I/O hangs for a long time, network read/write may timeout first and the
> peer may close the connection. Although the error was caused by a faulty
> local disk, disk check is not being carried out in this case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)