[ https://issues.apache.org/jira/browse/HDFS-15744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haibin Huang updated HDFS-15744: -------------------------------- Attachment: HDFS-15744-001.patch > Use cumulative counting way to improve the accuracy of slow disk detection > -------------------------------------------------------------------------- > > Key: HDFS-15744 > URL: https://issues.apache.org/jira/browse/HDFS-15744 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Haibin Huang > Assignee: Haibin Huang > Priority: Major > Attachments: HDFS-15744-001.patch, image-2020-12-22-11-37-14-734.png, > image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png > > > Hdfs has supported the datanode disk outlier detection in > [HDFS-11461|https://issues.apache.org/jira/browse/HDFS-11461], we can use it > to find out slow disk via > SlowDiskReport([HDFS-11551|https://issues.apache.org/jira/browse/HDFS-11551]).However > i found the slow disk information may not be accurate enough in practice. > Because a large number of short-term writes can lead to miscalculation. Here > is the example, this disk is health, when it encounters a lot of writing in a > few minute, it's write io does get slow, and will be considered to be slow > disk.The disk just slow in a few minute but SlowDiskReport will keep it until > the information becomes invalid. This scenario confuse us since we want to > use SlowDiskReport to detect the real bad disk. > !image-2020-12-22-11-37-14-734.png! > !image-2020-12-22-11-37-35-280.png! > To improve the deteciton accuracy, we use a cumulative counting way to detect > slow disk. If within the reportValidityMs interval, a disk is considered to > be outlier over 50% times, than it should be a real bad disk. > Here is an exsample, if reportValidityMs is one hour and detection interval > is five minute, there will be 12 times disk outlier detection in one hour. If > a disk is considered to be outlier over 6 times, it should be a real bad > disk. We use this way to detect bad disk in cluster, it can reach over 90% > accuracy. > !image-2020-12-22-11-46-48-817.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org