[
https://issues.apache.org/jira/browse/HDDS-10461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan updated HDDS-10461:
------------------------------------
Description:
Many users test out Ozone in small clusters of 15 Datanodes or less. If a 15 DN
cluster has some EC 10-4 containers, for example, then it's not possible to
decommission more than 1 Datanode, because EC 10-4 requires at least 14
Datanodes. If someone tries to decommission two Datanodes, the decommissioning
process is designed such that it will keep looping and checking every 30
seconds whether it's possible to take the two DNs offline. It will never fail,
even though it's clearly not possible to take the Datanodes offline.
This epic aims to implement a basic algorithm that will calculate whether it's
possible to take the requested number of Datanodes offline. The expectation is
that a decommission or maintenance request should return an error without
looping continuously if the cluster doesn't have the sufficient number of
Datanodes. While this isn't a problem in large, typical clusters, it'll
certainly make the user experience better for POC style small clusters.
Splitting this into separate tasks for decommission, maintenance, and
integration testing.
Attaching a design doc -
[https://docs.google.com/document/d/1DPbvS__I1iIwYtjXVH4zYjrMCPeAIguo4evH-dS5dhA/edit?usp=sharing.]
Reviews and suggestions are welcome!
was:
Many users test out Ozone in small clusters of 15 Datanodes or less. If a 15 DN
cluster has some EC 10-4 containers, for example, then it's not possible to
decommission more than 1 Datanode, because EC 10-4 requires at least 14
Datanodes. If someone tries to decommission two Datanodes, the decommissioning
process is designed such that it will keep looping and checking every 30
seconds whether it's possible to take the two DNs offline. It will never fail,
even though it's clearly not possible to take the Datanodes offline.
This epic aims to implement a basic algorithm that will calculate whether it's
possible to take the requested number of Datanodes offline. The expectation is
that a decommission or maintenance request should return an error without
looping continuously if the cluster doesn't have the sufficient number of
Datanodes. While this isn't a problem in large, typical clusters, it'll
certainly make the user experience better for POC style small clusters.
Splitting this into separate tasks for decommission, maintenance, and
integration testing.
> Try to fail early when taking a Datanode offline isn't possible because of
> insufficient number of Datanodes in the cluster
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-10461
> URL: https://issues.apache.org/jira/browse/HDDS-10461
> Project: Apache Ozone
> Issue Type: Epic
> Components: SCM
> Reporter: Siddhant Sangwan
> Assignee: Tejaskriya Madhan
> Priority: Major
>
> Many users test out Ozone in small clusters of 15 Datanodes or less. If a 15
> DN cluster has some EC 10-4 containers, for example, then it's not possible
> to decommission more than 1 Datanode, because EC 10-4 requires at least 14
> Datanodes. If someone tries to decommission two Datanodes, the
> decommissioning process is designed such that it will keep looping and
> checking every 30 seconds whether it's possible to take the two DNs offline.
> It will never fail, even though it's clearly not possible to take the
> Datanodes offline.
> This epic aims to implement a basic algorithm that will calculate whether
> it's possible to take the requested number of Datanodes offline. The
> expectation is that a decommission or maintenance request should return an
> error without looping continuously if the cluster doesn't have the sufficient
> number of Datanodes. While this isn't a problem in large, typical clusters,
> it'll certainly make the user experience better for POC style small clusters.
> Splitting this into separate tasks for decommission, maintenance, and
> integration testing.
>
> Attaching a design doc -
> [https://docs.google.com/document/d/1DPbvS__I1iIwYtjXVH4zYjrMCPeAIguo4evH-dS5dhA/edit?usp=sharing.]
> Reviews and suggestions are welcome!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]