[
https://issues.apache.org/jira/browse/HDDS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307093#comment-17307093
]
Mark Gui edited comment on HDDS-4666 at 3/23/21, 2:03 PM:
----------------------------------------------------------
I raise a related problem here, currently the MutableVolumeSet has a periodic
disk checker which runs every 15mins(fixed, not configurable), so it will be
about serveral mins before datanode get the chance to detect bad disks.
As I found, datanode does not handle failure on the write path, we may have to
introduce on-demand disk checks when io failure is hit(due to bad disks, etc).
By the way, hdfs has only a lazy on-demand disk checker:
[https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/]
was (Author: markgui):
I raise a related problem here, currently the MutableVolumeSet has a periodic
disk checker which runs every 15mins(fixed, not configurable), so it will be
about serveral mins before datanode get the chance to detect bad disks.
As I found, datanode does not handle failure on the write path, we may have to
introduce on-demand disk checks when io failure is hit(due to bad disks, etc).
By the way, hdfs has only a lazy on-demand disk checker:
[https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/]
> Handling disk issues in Datanodes
> ---------------------------------
>
> Key: HDDS-4666
> URL: https://issues.apache.org/jira/browse/HDDS-4666
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode, SCM
> Reporter: Shashikant Banerjee
> Assignee: Shashikant Banerjee
> Priority: Major
>
> Currently, there is no notion of reserved space on datanodes as it exists on
> hdfs datanodes. Similarly, a datanode low on disk capacity continues to
> participate in pipeline allocation activity and keep on receiving write
> requests and these requests fail and potentially will end up running into
> retry loop in the client.
> Similarly, ratis log disks are currently not accounted for by disk checker.
> Once a ratis disk gets full, existing pipelines can not be closed as closing
> of pipeline involves taking a ratis snapshot which will not succeed if the
> ratis disk is full. Similarly, new pipelines cannot be functional on such
> disks and will end up failing write requests and contribute in client retry
> chain.
> Similarly, nodes low on disk capacity should not be choosen as targets for
> container re-replication.
> The goal of the Jira is address disk related issues on datanodes holistically.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]