ed and if the node going into maintenance would cause the number of replicas to
fall below the minimum replica count, the relevant nodes go into a
decommissioning like state while new replicas are made for the blocks.
+ * Once the node goes into maintenance, it can be stopped etc and HDFS will
not be concerned about the under-replicated state of the blocks.
+ * When the expiry time passes, the node is put back to normal state (if it
is online and heartbeating) or marked as dead, at which time new replicas will
start to be made.
+
+This is very similar to decommissioning, and the code to track maintenance
mode and ensure the blocks are replicated etc, is effectively the same code as
with decommissioning. The one area that differs is probably in the replication
monitor as it must understand that the node is expected to be offline.
+
+The ideal way to use maintenance mode, is when you know there are a set of
nodes you can stop without having to do any replications. In HDFS, the rack
awareness states that all blocks should be on two racks, so that means a rack
can be put into maintenance safely.
+
+There is another feature in HDFS called "upgrade Domain" which allows each
datanode to be assigned a group. By default there should be at least 3 groups
(domains) and then each of the 3 replicas will be stored on different group,
allowing one full group to be put into maintenance at once. That is not yet
supported in CDH, but is something we are targeting for CDPD I believe.
+
+One other difference with maintenance mode and decommissioning, is that you
must have some sort of monitor thread checking for when maintenance is
scheduled to end. HDFS solves this by having a class called the
DatanodeAdminManager, and it tracks all nodes transitioning state, the
under-replicated block count on them etc.
+
+
+# Implementation
+
+
+## Datanode state machine
+
+`NodeStateManager` maintains the state of the connected datanodes. The
possible states:
+
+ state | description
+ ------------------|------------
+ HEALTHY | The node is up and running.
+ STALE | Some heartbeats were missing for an already missing
nodes.
+ DEAD | The stale node has not been recovered.
+ ENTER_MAINTENANCE | The in-progress state, scheduling is disabled but the
node can't not been turned off due to in-progress replication.
+ IN_MAINTENANCE | Node can be turned off but we expecteed to get it back
and have all the replicas.
+ DECOMMISSIONING | The in-progress state, scheduling is disabled, all the
containers should be replicated to other nodes.
+ DECOMMISSIONED | The node can be turned off, all the containers are
replicated to other machine
+
+
+
+## High level algorithm
+
+The Algorithm is pretty simple from the Decommission or Maintenance point of
view;
+
+ 1. Mark a data node as DECOMMISSIONING or ENTERING_MAINTENANCE. This implies
that node is NOT healthy anymore; we assume the use of a single flag and law of
excluded middle.
+
+ 2. Pipelines should be shut down and wait for confirmation that all pipelines
are shutdown. So no new I/O or container creation can happen on a Datanode that
is part of decomm/maint.
+
+ 3. Once the Node has been marked as DECOMMISSIONING or ENTERING_MAINTENANCE;
the Node will generate a list of containers that need replication. This list is
generated by the Replica Count decisions for each container; the Replica Count
will be computed by Replica Manager;
+
+ 4. Once the Replica Count for these containers go back to Zero, which means
that we have finished with the pending replications, the containers from this
wait list will be removed.
+
+ 5. Once the size of the waitlist reaches zero; maintenance mode or
decommission is complete.
+
+ 5. We will update the node state to DECOMMISSIONED or IN_MAINTENANCE reached
state.
+
+_Replica count_ is a calculated number which represents the number of
_missing_ replicas. The number can be negative in case of an over-replicated
container.
+
+
+## Calculation of the _Replica count_ (required replicas)
+
+### Counters / Variables
+
+We have 7 different datanode state and three different type of container state
(replicated or in-flight deletion / in-flight replication). To calculate the
required replicas we should introduce a few variables.
+
+Note: we don't need to use all the possible counters but the following table
summarize how the counters are calculated for the following algorithm.
+
+For example the `maintenance` variable includes the number of the existing
replicas on ENTERING_MAINTENANCE or IN_MAINTENANCE nodes.
+
+Each counters should be calculated per container bases.
+
+ Node state | Containers - in-flight deletion |
In-Flight |
+
--------------------------------------|---------------------------------|-------------------------|
+ HEALTHY | `healthy`
| `inFlight`
+ STALE + DEAD + DECOMMISSIONED | |
Review comment:
whitespace:tabs in line
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 285744)
Time Spent: 1h (was: 50m)
> Design doc: decommissioning in Ozone
> ------------------------------------
>
> Key: HDDS-1881
> URL: https://issues.apache.org/jira/browse/HDDS-1881
> Project: Hadoop Distributed Data Store
> Issue Type: Sub-task
> Reporter: Elek, Marton
> Assignee: Elek, Marton
> Priority: Major
> Labels: design, pull-request-available
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Design doc can be attached to the documentation. In this jira the design doc
> will be attached and merged to the documentation page.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]