[
https://issues.apache.org/jira/browse/HDFS-15744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268996#comment-17268996
]
Haibin Huang commented on HDFS-15744:
-------------------------------------
[~ayushtkn] [~aajisaka] [~elgoiri] [~hexiaoqiao] would you mind take a look at
this? We use this way to detect slow disk in our company, and the accuracy of
finding bad disk is over 90% .
> Use cumulative counting way to improve the accuracy of slow disk detection
> --------------------------------------------------------------------------
>
> Key: HDFS-15744
> URL: https://issues.apache.org/jira/browse/HDFS-15744
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Haibin Huang
> Assignee: Haibin Huang
> Priority: Major
> Attachments: HDFS-15744-001.patch, image-2020-12-22-11-37-14-734.png,
> image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png
>
>
> Hdfs has supported the datanode disk outlier detection in
> [HDFS-11461|https://issues.apache.org/jira/browse/HDFS-11461], we can use it
> to find out slow disk via
> SlowDiskReport([HDFS-11551|https://issues.apache.org/jira/browse/HDFS-11551]).However
> i found the slow disk information may not be accurate enough in practice.
> Because a large number of short-term writes can lead to miscalculation. Here
> is the example, this disk is health, when it encounters a lot of writing in a
> few minute, it's write io does get slow, and will be considered to be slow
> disk.The disk just slow in a few minute but SlowDiskReport will keep it until
> the information becomes invalid. This scenario confuse us since we want to
> use SlowDiskReport to detect the real bad disk.
> !image-2020-12-22-11-37-14-734.png!
> !image-2020-12-22-11-37-35-280.png!
> To improve the deteciton accuracy, we use a cumulative counting way to detect
> slow disk. If within the reportValidityMs interval, a disk is considered to
> be outlier over 50% times, than it should be a real bad disk.
> Here is an exsample, if reportValidityMs is one hour and detection interval
> is five minute, there will be 12 times disk outlier detection in one hour. If
> a disk is considered to be outlier over 6 times, it should be a real bad
> disk. We use this way to detect bad disk in cluster, it can reach over 90%
> accuracy.
> !image-2020-12-22-11-46-48-817.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]