[ 
https://issues.apache.org/jira/browse/HDDS-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291590#comment-17291590
 ] 

Stephen O'Donnell commented on HDDS-1880:
-----------------------------------------

[~ferhui] Thanks for the question.

Putting a node to maintenance may not be "free". It is possible there will need 
to be some replication to ensure the containers are still minimally replicated 
(2 copies by default, but can be set to 1), so putting a lot of nodes to 
maintenance at once could be expensive too.

What I would really like to see, is that an operator should not need to worry 
about the number of nodes they take out of service at the same time. For 
example, if you have a 100 node cluster and want to decommission 50 nodes, and 
assuming:

1. The load on the cluster is such that 50 nodes can handle it
2. There is less than 50% space used on the cluster

You should just decommission all 50 nodes at once and the system should take 
care to throttle the replication so it remains functional. In some respects, 
Ozone should handle this better than HDFS, as it is replicating at the 
container level. The total containers on the cluster should be much less than 
the number of blocks on a HDFS cluster, so there are less replication jobs to 
schedule and keep track of,

However the current replication manager implementation in Ozone will not handle 
the above scenario well. It will schedule all the replication jobs out to all 
the DNs on the first pass. While the tasks are queued, there are timeouts when 
they get reschedule etc.

If our goal is to be able to decommission a large part of the cluster at once, 
then I think we need some work in the replication manager, but it should be 
possible to achieve this safely without decommissioning in small batches.

> Decommissioning and maintenance mode in Ozone 
> ----------------------------------------------
>
>                 Key: HDDS-1880
>                 URL: https://issues.apache.org/jira/browse/HDDS-1880
>             Project: Apache Ozone
>          Issue Type: New Feature
>          Components: SCM
>            Reporter: Marton Elek
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Decommission and Maintenance Overview.pdf
>
>
> This is the umbrella jira for decommissioning support in Ozone. Design doc 
> will be attached soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to