[jira] [Comment Edited] (HDDS-4666) Handling disk issues in Datanodes

Mark Gui (Jira) Tue, 23 Mar 2021 07:04:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307093#comment-17307093
 ]


Mark Gui edited comment on HDDS-4666 at 3/23/21, 2:03 PM:
----------------------------------------------------------

I raise a related problem here, currently the MutableVolumeSet has a periodic 
disk checker which runs every 15mins(fixed, not configurable), so it will be 
about serveral mins before datanode get the chance to detect bad disks.

As I found, datanode does not handle failure on the write path, we may have to 
introduce on-demand disk checks when io failure is hit(due to bad disks, etc).

By the way, hdfs has only a lazy on-demand disk checker:

[https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/]


was (Author: markgui):
I raise a related problem here, currently the MutableVolumeSet has a periodic 
disk checker which runs every 15mins(fixed, not configurable), so it will be 
about serveral mins before datanode get the chance to detect bad disks.

As I found, datanode does not handle failure on the write path, we may have to 
introduce on-demand disk checks when io failure is hit(due to bad disks, etc).

By the way, hdfs has only a lazy on-demand disk checker:

[https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/]

> Handling disk issues in Datanodes
> ---------------------------------
>
>                 Key: HDDS-4666
>                 URL: https://issues.apache.org/jira/browse/HDDS-4666
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode, SCM
>            Reporter: Shashikant Banerjee
>            Assignee: Shashikant Banerjee
>            Priority: Major
>
> Currently, there is no notion of reserved space on datanodes as it exists on 
> hdfs datanodes. Similarly, a datanode low on disk capacity continues to 
> participate in pipeline allocation activity and keep on receiving write 
> requests and these requests fail and potentially will end up running into 
> retry loop in the client.
> Similarly, ratis log disks are currently not accounted for by disk checker. 
> Once a ratis disk gets full, existing pipelines can not be closed as closing 
> of pipeline involves taking a ratis snapshot which will not succeed if the 
> ratis disk is full. Similarly, new pipelines cannot be functional on such 
> disks and will end up failing write requests and contribute in client retry 
> chain.
> Similarly, nodes low on disk capacity should not be choosen as targets for 
> container re-replication.
> The goal of the Jira is address disk related issues on datanodes holistically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-4666) Handling disk issues in Datanodes

Reply via email to