[jira] [Commented] (HDFS-15744) Use cumulative counting way to improve the accuracy of slow disk detection

Haibin Huang (Jira) Wed, 20 Jan 2021 18:49:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268996#comment-17268996
 ]


Haibin Huang commented on HDFS-15744:
-------------------------------------

[~ayushtkn] [~aajisaka] [~elgoiri] [~hexiaoqiao] would you mind take a look at 
this? We use this way to detect slow disk in our company, and the accuracy of 
finding bad disk is over 90% .

> Use cumulative counting way to improve the accuracy of slow disk detection
> --------------------------------------------------------------------------
>
>                 Key: HDFS-15744
>                 URL: https://issues.apache.org/jira/browse/HDFS-15744
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Haibin Huang
>            Assignee: Haibin Huang
>            Priority: Major
>         Attachments: HDFS-15744-001.patch, image-2020-12-22-11-37-14-734.png, 
> image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png
>
>
> Hdfs has supported the datanode disk outlier detection in 
> [HDFS-11461|https://issues.apache.org/jira/browse/HDFS-11461], we can use it 
> to find out slow disk via 
> SlowDiskReport([HDFS-11551|https://issues.apache.org/jira/browse/HDFS-11551]).However
>  i found the slow disk information may not be accurate enough in practice.
> Because a large number of short-term writes can lead to miscalculation. Here 
> is the example, this disk is health, when it encounters a lot of writing in a 
> few minute, it's write io does get slow, and will be considered to be slow 
> disk.The disk just slow in a few minute but SlowDiskReport will keep it until 
> the information becomes invalid. This scenario confuse us since we want to 
> use SlowDiskReport to detect the real bad disk.
> !image-2020-12-22-11-37-14-734.png!
> !image-2020-12-22-11-37-35-280.png!
> To improve the deteciton accuracy, we use a cumulative counting way to detect 
> slow disk. If within the reportValidityMs interval, a disk is considered to 
> be outlier over 50% times, than it should be a real bad disk.
> Here is an exsample, if reportValidityMs is one hour and detection interval 
> is five minute, there will be 12 times disk outlier detection in one hour. If 
> a disk is considered to be outlier over 6 times, it should be a real bad 
> disk. We use this way to detect bad disk in cluster, it can reach over 90% 
> accuracy.
> !image-2020-12-22-11-46-48-817.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15744) Use cumulative counting way to improve the accuracy of slow disk detection

Reply via email to