Re: [PR] HDDS-10666. Replication Manager 2: Architecture and concepts [ozone-site]

via GitHub Thu, 11 Apr 2024 16:16:01 -0700


errose28 commented on code in PR #88:
URL: https://github.com/apache/ozone-site/pull/88#discussion_r1561797665



##########
docs/03-core-concepts/02-replication/05-replication-manager.md:
##########
@@ -0,0 +1,316 @@
+# Replication Manager
+
+Replication Manager (RM) is a thread which runs inside the leader SCM daemon 
in an Ozone cluster. Its role is to periodically check the health of all the 
containers in the cluster, and take action for any containers which are not 
healthy. Often that action involves arranging for new replicas of the container 
to be created, but it can also involve closing the replicas, deleting empty 
replicas and so on.
+
+## Architecture
+
+The RM process is split into stages - one to check containers and identify 
those with problems and another to process the problem containers.
+
+### Container Check Stage
+
+The check phase runs periodically at 5 minute intervals. First it gathers a 
list of all containers on the cluster, and each container is passed through a 
chain of rules to identify if it has any problems. These rules are arranged in 
a similar way to the “Chain of Responsibility” design pattern, where the first 
rule which “matches” causes the chain to exit. Each type of check is 
implemented in a standalone Java class, and the checks are processed in a 
defined order:
+
+```java
+containerCheckChain = new OpenContainerHandler(this);
+containerCheckChain
+   .addNext(new ClosingContainerHandler(this, clock))
+   .addNext(new QuasiClosedContainerHandler(this))
+   .addNext(new MismatchedReplicasHandler(this))
+   .addNext(new EmptyContainerHandler(this))
+   .addNext(new DeletingContainerHandler(this))
+   .addNext(ecReplicationCheckHandler)
+   .addNext(ratisReplicationCheckHandler)
+   .addNext(new ClosedWithUnhealthyReplicasHandler(this))
+   .addNext(ecMisReplicationCheckHandler)
+   .addNext(new RatisUnhealthyReplicationCheckHandler())
+   .addNext(new VulnerableUnhealthyReplicasHandler(this));
+```
+
+## ReplicationManager Report
+
+Each time the check phase of the Replication Manager runs, it generates a 
report which can be accessed via the command “ozone admin container report”. 
This report provides details of all containers in the cluster which have an 
issue to be corrected by the Replication Manager.
+
+The output of the command looks as follows:
+
+```text
+# ozone admin container report
+Container Summary Report generated at 2023-08-14T13:01:43Z
+==========================================================
+
+
+Container State Summary
+=======================
+OPEN: 10
+CLOSING: 0
+QUASI_CLOSED: 4
+CLOSED: 95
+DELETING: 0
+DELETED: 298
+RECOVERING: 0
+
+
+Container Health Summary
+========================
+UNDER_REPLICATED: 69
+MIS_REPLICATED: 0
+OVER_REPLICATED: 0
+MISSING: 11
+UNHEALTHY: 0
+EMPTY: 4
+OPEN_UNHEALTHY: 0
+QUASI_CLOSED_STUCK: 0
+
+
+First 100 UNDER_REPLICATED containers:
+#1001, #2003, #4001, #4002, #4004, #4005, #5002, #6006, #6009, #7003, #7006, 
#9004, #9006, #10002, #11001
+
+
+First 100 MISSING containers:
+#6010, #7004, #7005, #24002, #24003, #24004, #28001, #31001, #34003, #54001, 
#61003
+
+
+First 100 EMPTY containers:
+#52005, #54003, #55001, #55002
+```
+
+### Container State Summary
+
+The first section of the report “Container State Summary” summarizes the state 
of all containers in the cluster. Containers can move through these states as 
they are opened, filled with block data and have data removed over time. A 
container should only be in one of these states at any time, and the sum of the 
containers in this section should equal the number of containers in the 
cluster. Each state is explored in the following sections.
+
+#### Open
+
+Open containers are available for writes into the cluster.
+
+#### Closing
+
+Closing containers are in the process of being closed. They will transition to 
closing when they have enough data to be considered full, or there is a problem 
with the write pipeline, such as a Datanode going down.
+
+#### Quasi Closed
+
+A container moves to quasi closed when a Datanode attempts to close the 
replica, but it was not able to close it cleanly due to the Ratis Pipeline 
being unavailable. This could happen if a Datanode goes down unexpectedly, for 
example. Replication Manager will attempt to close the container by identifying 
the replica with the highest Block Commit Sequence ID (BCSID) and close it. As 
replicas with older BCSID are stale, new copies will be made from the closed 
replica before removing the stale replicas.
+
+#### Closed
+
+Closed containers have successfully transitioned from closing to closed. This 
is a normal state for containers to move to, and the majority of containers in 
the cluster should be in this state.
+
+#### Deleting
+
+Containers which were closed and had all blocks deleted over time leaving them 
empty transition to deleting. The containers remain in this state until all the 
replicas have been removed from the Datanodes.
+
+#### Deleted
+
+When the “deleting” process completes and all replicas have been removed, the 
container will move to the deleted state and remain there.
+
+#### Recovering
+
+Recovering is a temporary state container replicas can go into on the 
Datanodes, and is related to EC reconstruction. The report should always have a 
count of zero for this state, as the state is not managed by the Replication 
Manager.
+
+### Container Health Summary
+
+The next section of the report lists the number of containers in various 
health states on the cluster. Note that a count of “healthy” containers is not 
presented, only degraded states.  In an otherwise healthy cluster, the 
Replication Manager should work to correct the state of any containers in the 
states listed, except for Missing and Unhealthy which it cannot repair.
+
+#### Under Replicated
+
+Under-Replicated containers have less than the number of expected replicas. 
This could be caused by decommissioning or maintenance mode transitions on the 
Datanode, or due to failed disks or failed nodes within the cluster. Unhealthy 
replicas also make a container under-replicated, as they have a problem which 
must be corrected. See the Unhealthy section below for more details on the 
unhealthy state. The Replication Manager will schedule commands to make 
additional copies of the under replicated containers.
+
+#### Mis-Replicated
+
+If the container has the correct number of replicas, but they are not spread 
across sufficient racks to meet the requirements of the container placement 
policy, the container is Mis-Replicated. Again, Replication Manager will work 
to move replicas to additional racks by making new copies of the relevant 
replicas and then removing the excess.
+
+#### Over Replicated
+
+Over Replicated containers have too many replicas. This could occur due to 
correcting mis-replication, or if a decommissioned or down host is returned to 
the cluster after the under replication has been corrected. Replication Manager 
will schedule delete replica commands to remove the excess replicas while 
maintaining the container placement policy rules around rack placement.
+
+#### Missing
+
+A container is missing if there are not enough replicas available to read it. 
For a Ratis container, that would mean zero copies are online. For an EC 
container, it is marked as missing if less than “data number” of replicas are 
available. Eg, for a RS-6-3 container, having less than 6 replicas online would 
render it missing. For missing containers, the Replication Manager cannot do 
anything to correct them. Manual intervention will be needed to bring lost 
nodes back into the cluster, or take steps to remove the containers from SCM 
and any related keys from OM, as the data will not be accessible.
+
+#### Unhealthy
+
+A container is unhealthy, if it is not missing and there are insufficient 
healthy replicas to allow the container to be read completely.
+
+A replica can get marked as unhealthy by the scanner process on the Datanode 
for various reasons. For example, it can detect if a block in the container has 
an invalid checksum and mark the replica unhealthy. For a Ratis container, it 
will be marked as unhealthy if all its container replicas are unhealthy with no 
healthy replicas available. Note that it may be possible to read most of the 
data in an unhealthy container. For Ratis, each replica could have a different 
problem affecting a different block in each replica, for example a checksum 
violation on read. The Ozone client would catch the read error and try the read 
again from another replica. However data recovery will depend on the number and 
level of corruption, and whether the same blocks are corrupted in all replicas.
+
+#### Unhealthy Ratis
+
+A Ratis container with 3 replicas, Healthy, Healthy, Unhealthy is still fully 
readable and hence recoverable, so it will be marked as under replicated as the 
unhealthy replica needs to be replaced. A Ratis container with 3 Unhealthy 
replicas will be marked as unhealthy. It is not missing, as there are replicas 
available and it is not under-replicated as it has all 3 copies. A Ratis 
container with only 2 unhealthy replicas is both unhealthy and under 
replicated, and it will be marked as both Unhealthy and Under-Replicated. The 
Replication Manager will attempt to make an additional copy of the unhealthy 
container to resolve the under replication, but it will not be able to correct 
the unhealthy state without manual intervention, as there is no good copy to 
copy from.
+
+#### Unhealthy EC
+
+EC containers are similar. They are only marked unhealthy if they are not 
missing (at least data number of replicas available), but there isn’t at least 
“data number” of healthy replicas. See the following tables for examples:
+
+| Index = 1 | Index = 2            | Index = 3 | Index = 4 | Index = 5 | State 
    |
+| --------- | -------------------- | --------- | --------- | --------- | 
--------- |
+| Healthy   | Healthy              | Healthy   | Unhealthy | Unhealthy | 
Under-Rep |
+| Healthy   | Healthy              | Healthy   |           |           | 
Under-Rep |
+| Healthy   | Healthy              | Unhealthy |           |           | 
Unhealthy |
+| Healthy   | Healthy              |           |           |           | 
Missing   |
+| Healthy   | Healthy + Unhealthy  |           |           |           | 
Missing and Over Replicated |
+| Healthy   | Healthy + Unhealthy  | Healthy   |           |           | Under 
and Over Replicated |
+
+Note it is possible for an EC container to be both Unhealthy and Over 
Replicated, as there may be two copies of the same index, one healthy and one 
unhealthy.
+
+If a container is unhealthy, the Replication Manager will not be able to 
correct it without some manual intervention, as unhealthy replicas cannot be 
used in reconstruction. It may be possible to read much of the data from the 
container as an unhealthy container may only have a problem with a single 
block, but if there are legitimate corruptions in an unhealthy EC container it 
is likely at least some of the data is unreadable.
+
+#### Empty
+
+A container is marked as empty if it has been closed and then all data blocks 
stored in the container have been removed. When this is detected, the container 
transitions from CLOSED to DELETING and therefore containers should only stay 
in the Empty state until the next Replication Manager check stage.
+
+#### Open Unhealthy
+
+When a container is open, it is expected that all the replicas are also in the 
same open state. If a problem occurs, which causes a replica to move from the 
open state, the Replication Manager will mark the container as Open Unhealthy 
and trigger the close process. Normally such a container will have transitioned 
to Closing or Closed by the next Replication Manager check stage.
+
+#### Quasi Closed Stuck
+
+This is relevant only for Ratis containers. When a container is in the Quasi 
Closed state, the Replication Manager needs to wait for the majority of 
replicas (2 out of 3) to reach the Quasi Closed state before it can transition 
the container to closed. While this is not the case, the container will be 
marked as Quasi Closed Stuck.
+
+#### Unhealthy Container Samples
+
+To facilitate investigating problems with degraded containers, the report 
includes a sample of the first 100 container IDs in each state and includes 
them in the report. Given these IDs, it is possible to see if the same 
containers are continuously stuck, and also get more information about the 
container via the “ozone admin container info ID” command.

Review Comment:
   ```suggestion
   To facilitate investigating problems with degraded containers, the report 
includes a sample of the first 100 container IDs in each state and includes 
them in the report. Given these IDs, it is possible to see if the same 
containers are continuously stuck, and also get more information about the 
container via the `ozone admin container info` command. See [Troubleshooting 
Containers](troubleshooting/storage-containers) for more information.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10666. Replication Manager 2: Architecture and concepts [ozone-site]

Reply via email to