[ 
https://issues.apache.org/jira/browse/HDDS-1881?focusedWorklogId=301844&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301844
 ]

ASF GitHub Bot logged work on HDDS-1881:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Aug/19 09:53
            Start Date: 27/Aug/19 09:53
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on pull request #1196: 
HDDS-1881. Design doc: decommissioning in Ozone
URL: https://github.com/apache/hadoop/pull/1196#discussion_r317993068
 
 

 ##########
 File path: hadoop-hdds/docs/content/design/decommissioning.md
 ##########
 @@ -0,0 +1,610 @@
+---
+title: Decommissioning in Ozone
+summary: Formal process to shut down machines in a safe way after the required 
replications.
+date: 2019-07-31
+jira: HDDS-1881
+status: current
+author: Anu Engineer, Marton Elek, Stephen O'Donnell
+---
+
+# Abstract
+
+The goal of decommissioning is to turn off a selected set of machines without 
data loss. It may or may not require to move the existing replicas of the 
containers to other nodes.
+
+There are two main classes of the decommissioning:
+
+ * __Maintenance mode__: where the node is expected to be back after a while. 
It may not require replication of containers if enough replicas are available 
from other nodes (as we expect to have the current replicas after the restart.)
+
+ * __Decommissioning__: where the node won't be started again. All the data 
should be replicated according to the current replication rules.
+
+Goals:
+
+ * Decommissioning can be canceled any time
+ * The progress of the decommissioning should be trackable
+ * The nodes under decommissioning / maintenance mode should not been used for 
new pipelines / containers
+ * The state of the datanodes should be persisted / replicated by the SCM (in 
HDFS the decommissioning info exclude/include lists are replicated manually by 
the admin). If datanode is marked for decommissioning this state be available 
after SCM and/or Datanode restarts.
+ * We need to support validations before decommissioing (but the violations 
can be ignored by the admin).
+ * The administrator should be notified when a node can be turned off.
+ * The maintenance mode can be time constrained: if the node marked for 
maintenance for 1 week and the node is not up after one week, the containers 
should be considered as lost (DEAD node) and should be replicated.
+
+# Introduction
+
+Ozone is a highly available file system that relies on commodity hardware. In 
other words, Ozone is designed to handle failures of these nodes all the time.
+
+The Storage Container Manager(SCM) is designed to monitor the node health and 
replicate blocks and containers as needed.
+
+At times, Operators of the cluster can help the SCM by giving it hints. When 
removing a datanode, the operator can provide a hint. That is, a planned 
failure of the node is coming up, and SCM can make sure it reaches a safe state 
to handle this planned failure.
+
+Some times, this failure is transient; that is, the operator is taking down 
this node temporarily. In that case, we can live with lower replica counts by 
being optimistic.
+
+Both of these operations, __Maintenance__, and __Decommissioning__ are similar 
from the Replication point of view. In both cases, and the user instructs us on 
how to handle an upcoming failure.
+
+Today, SCM (*Replication Manager* component inside SCM) understands only one 
form of failure handling. This paper extends Replica Manager failure modes to 
allow users to request which failure handling model to be adopted(Optimistic or 
Pessimistic).
+
+Based on physical realities, there are two responses to any perceived failure, 
to heal the system by taking corrective actions or ignore the failure since the 
actions in the future will heal the system automatically.
+
+## User Experiences (Decommissioning vs Maintenance mode)
+
+From the user's point of view, there are two kinds of planned failures that 
the user would like to communicate to Ozone.
+
+The first kind is when a 'real' failure is going to happen in the future. This 
'real' failure is the act of decommissioning. We denote this as "decommission" 
throughout this paper. The response that the user wants is SCM/Ozone to make 
replicas to deal with the planned failure.
+
+The second kind is when the failure is 'transient.' The user knows that this 
failure is temporary and cluster in most cases can safely ignore this issue. 
However, if the transient failures are going to cause a failure of 
availability; then the user would like the Ozone to take appropriate actions to 
address it.  An example of this case, is if the user put 3 data nodes into 
maintenance mode and switched them off.
+
+The transient failure can violate the availability guarantees of Ozone; Since 
the user is telling us not to take corrective actions. Many times, the user 
does not understand the impact on availability while asking Ozone to ignore the 
failure.
+
+So this paper proposes the following definitions for Decommission and 
Maintenance of data nodes.
+
+__Decommission__ *of a data node is deemed to be complete when SCM/Ozone 
completes the replica of all containers on decommissioned data node to other 
data nodes.That is, the expected count matches the healthy count of containers 
in the cluster*.
+
+__Maintenance mode__ of a data node is complete if Ozone can guarantee at 
*least one copy of every container* is available in other healthy data nodes.
+
+## Examples
+
+Here are some illustrative examples:
+
+1.  Let us say we have a container, which has only one copy and resides on 
Machine A. If the user wants to put machine A into maintenance mode; Ozone will 
make a replica before entering the maintenance mode.
+
+2. Suppose a container has two copies, and the user wants to put Machine A to 
maintenance mode. In this case; the Ozone understands that availability of the 
container is not affected and hence can decide to forgo replication.
+
+3. Suppose a container has two copies, and the user wants to put Machine A 
into maintenance mode. However, the user wants to put the machine into 
maintenance mode for one month. As the period of maintenance mode increases, 
the probability of data loss increases; hence, Ozone might choose to make a 
replica of the container even if we are entering maintenance mode.
+
+4. The semantics of decommissioning means that as long as we can find copies 
of containers in other machines, we can technically get away with calling 
decommission complete. Hence this clarification node; in the ordinary course of 
action; each decommission will create a replication flow for each container we 
have; however, it is possible to complete a decommission of a data node, even 
if we get a failure of the  data node being decommissioned. As long as we can 
find the other datanodes to replicate from and get the number of replicas 
needed backup to expected count we are good.
+
+5. Let us say we have a copy of a container replica on Machine A, B, and C. It 
is possible to decommission all three machines at the same time, as 
decommissioning is just a status indicator of the data node and until we finish 
the decommissioning process.
+
+
+The user-visible features for both of these  are very similar:
+
+Both Decommission and Maintenance mode can be canceled any time before the 
operation is marked as completed by SCM.
+
+Decommissioned nodes, if and when added back, shall be treated as new data 
nodes; if they have blocks or containers on them, they can be used to 
reconstruct data.
+
+
+## Maintenance mode in HDFS
+
+HDFS supports decommissioning and maintenance mode similar to Ozone. This is a 
quick description of the HDFS approach.
+
+The usage of HDFS maintenance mode:
+
+  * First, you set a minimum replica count on the cluster, which can be zero, 
but defaults to 1.
+  * Then you can set a number of nodes into maintenance, with an expiry time 
or have them remain in maintenance forever, until they are manually removed. 
Nodes are put into maintenance in much the same way as nodes are decommissioned.
+ * When a set of nodes go into maintenance, all blocks hosted on them are 
scanned and if the node going into maintenance would cause the number of 
replicas to fall below the minimum replica count, the relevant nodes go into a 
decommissioning like state while new replicas are made for the blocks.
+  * Once the node goes into maintenance, it can be stopped etc and HDFS will 
not be concerned about the under-replicated state of the blocks.
+  * When the expiry time passes, the node is put back to normal state (if it 
is online and heartbeating) or marked as dead, at which time new replicas will 
start to be made.
+
+This is very similar to decommissioning, and the code to track maintenance 
mode and ensure the blocks are replicated etc, is effectively the same code as 
with decommissioning. The one area that differs is probably in the replication 
monitor as it must understand that the node is expected to be offline.
+
+The ideal way to use maintenance mode, is when you know there are a set of 
nodes you can stop without having to do any replications. In HDFS, the rack 
awareness states that all blocks should be on two racks, so that means a rack 
can be put into maintenance safely.
+
+There is another feature in HDFS called "upgrade Domain" which allows each 
datanode to be assigned a group. By default there should be at least 3 groups 
(domains) and then each of the 3 replicas will be stored on different group, 
allowing one full group to be put into maintenance at once. That is not yet 
supported in CDH, but is something we are targeting for CDPD I believe.
+
+One other difference with maintenance mode and decommissioning, is that you 
must have some sort of monitor thread checking for when maintenance is 
scheduled to end. HDFS solves this by having a class called the 
DatanodeAdminManager, and it tracks all nodes transitioning state, the 
under-replicated block count on them etc.
+
+
+# Implementation
+
+
+## Datanode state machine
+
+`NodeStateManager` maintains the state of the connected datanodes. The 
possible states:
+
+  state                | description
+  ---------------------|------------
+  HEALTHY              | The node is up and running.
+  STALE                | Some heartbeats were missing for an already missing 
nodes.
+  DEAD                 | The stale node has not been recovered.
+  ENTERING_MAINTENANCE | The in-progress state, scheduling is disabled but the 
node can't not been turned off due to in-progress replication.
+  IN_MAINTENANCE       | Node can be turned off but we expecteed to get it 
back and have all the replicas.
+  DECOMMISSIONING      | The in-progress state, scheduling is disabled, all 
the containers should be replicated to other nodes.
+  DECOMMISSIONED       | The node can be turned off, all the containers are 
replicated to other machine
+
+
+
+## High level algorithm
+
+The Algorithm is pretty simple from the Decommission or Maintenance point of 
view;
+
+ 1. Mark a data node as DECOMMISSIONING or ENTERING_MAINTENANCE. This implies 
that node is NOT healthy anymore; we assume the use of a single flag and law of 
excluded middle.
+
+ 2. Pipelines should be shut down and wait for confirmation that all pipelines 
are shutdown. So no new I/O or container creation can happen on a Datanode that 
is part of decomm/maint.
+
+ 3. Once the Node has been marked as DECOMMISSIONING or ENTERING_MAINTENANCE; 
the Node will generate a list of containers that need replication. This list is 
generated by the Replica Count decisions for each container; the Replica Count 
will be computed by Replica Manager;
+
+ 4. Replica Manager will check the stop condition for each node. The following 
should be true for all the containers to go from DECOMMISSIONING to 
DECOMMISSIONED or from ENTERING_MAINTENANCE to IN_MAINTENANCE.
+
+   * Container is closed.
+   * We have at least one HEALTHY copy at all times.
+   * For entering DECOMMISSIONED mode `maintenance + healthy` must equal to 
`expectedeCount`
+
+ 5. We will update the node state to DECOMMISSIONED or IN_MAINTENANCE reached 
state.
+
+_Replica count_ is a calculated number which represents the number of 
_missing_ replicas. The number can be negative in case of an over-replicated 
container.
+
+## Calculation of the _Replica count_ (required replicas)
+
+### Counters / Variables
+
+We have 7 different datanode state, but some of them can be combined. At high 
level we can group the existing state to three groups:
+
+ * healthy state (when the container is available)
+ * maintenance (including IN_MAINTENANCE and ENTERING_MAINTENANCE)
+ * all the others.
+
+To calculate the required steps (required replication + stop condition) we 
need counter about the first two categories.
+
+Each counters should be calculated per container bases.
+
+   Node state                            | Variable (# of containers)      |
+   --------------------------------------|---------------------------------|
+   HEALTHY                                  | `healthy`                       |
+   STALE + DEAD + DECOMMISSIONED            |                                 |
 
 Review comment:
   whitespace:tabs in line
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 301844)
    Time Spent: 42h 10m  (was: 42h)

> Design doc: decommissioning in Ozone
> ------------------------------------
>
>                 Key: HDDS-1881
>                 URL: https://issues.apache.org/jira/browse/HDDS-1881
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>              Labels: design, pull-request-available
>          Time Spent: 42h 10m
>  Remaining Estimate: 0h
>
> Design doc can be attached to the documentation. In this jira the design doc 
> will be attached and merged to the documentation page.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to