ChenSammi commented on code in PR #6121:
URL: https://github.com/apache/ozone/pull/6121#discussion_r1511112637


##########
hadoop-hdds/docs/content/design/container-reconciliation.md:
##########
@@ -0,0 +1,344 @@
+---
+title: Container Reconciliation
+summary: Allow Datanodes to reconcile mismatched container contents regardless 
of their state.
+date: 2024-01-29
+jira: HDDS-10239
+status: draft
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+# Container Reconciliation
+
+This document outlines the proposed recovery protocol for containers where one 
or more replicas are not cleanly closed or have potential data inconsistencies. 
It aims to provide an overview of the planned changes and their implications, 
focusing on the overall flow and key design decisions.
+
+## Nomenclature
+1. Container: A container is a logical unit of storage management in Ozone. It 
is a collection of blocks that are used to store data.
+2. Container Replica/Instance: A container replica is a copy of a container 
that is stored on a Datanode or a shard of an Erasure Coded Container.
+3. Block: A block is a collection of chunks that are used to store data. An 
Ozone object consists of one or more blocks.
+4. Chunk: A chunk is a collection of bytes that are used to store data. A 
chunk is the smallest unit of read and write in Ozone.
+
+## Background
+
+This proposal is motivated by the need to reconcile mismatched container 
replica states and contents among container replicas.
+This covers
+1. Containers replicas that are not cleanly closed.
+2. Containers replicas that have potential data inconsistencies due to bugs or 
broad failure handling on the write path.
+3. Silent data corruption that may occur in the system.
+4. The need to verify the equality and integrity of all closed containers 
replicas.
+5. Deleted blocks within a container that still exists in some container 
replicas.
+6. The need to simplify the replication manager for how to handle cases where 
only quasi-closed and unhealthy container replicas are available.
+
+Ideally, a healthy Ozone cluster would contain only open and closed container 
replicas. However, container replicas commonly end up with a mix of states 
including quasi-closed and unhealthy that the current system is not able to 
resolve to cleanly closed replicas. The cause of these states is often bugs or 
broad failure handling on the write path. While we should fix these causes, 
they raise the problem that Ozone is not able to reconcile these mismatched 
container replica states on its own, regardless of their cause. This has lead 
to significant complexity in the replication manager for how to handle cases 
where only quasi-closed and unhealthy replicas are available, especially in the 
case of decommissioning.
+
+Even when all container replicas are closed, the system assumes that these 
closed container replicas are equal with no way to verify this. During writes a 
client provides a checksum for the chunk that is written. 
+The scanner validates periodically that the checksums of the chunks on disk 
match the checksums provided by the client. It is possible that the checksum of 
a chunk on disk does not match the client provided checksum recorded at the 
time of write. Additionally, during container replica copying, the consistency 
of the data is not validated, opening the possibility of silent data corruption 
propagating through the system.
+
+This document proposes a container reconciliation protocol to solve these 
problems. After implementing the proposal:
+1. It should be possible for a cluster to progress to a state where all not 
open containers are closed and meeting the desired replication factor.
+2. We can verify the equality and integrity of all closed containers.
+
+Note: This document does not cover the case where the checksums recorded at 
the time of write match the chunks locally within a Datanode but differ across 
replicas. We assume that the replication code path is correct and that the 
checksums are correct. If this is not the case, the system is already in a 
failed state and the reconciliation protocol will not be able to recover it. 
Chunks once written are not updated, thus this scenario is not expected to 
occur.
+## Guiding Principles
+
+1. **User Focus**: Users prioritize data durability and availability above all 
else.
+   - From the user perspective, containers labelled quasi-closed and unhealthy 
represent compromised durability and availability, regardless of the 
container's actual contents.
+
+2. **Focus on Recovery Paths**: Focusing on the path to a failed state is 
secondary to focusing on the path out of failed states.
+    - For example, we should not focus on whether it is possible for two 
replicated closed containers to have differing content, only on whether the 
system could detect and recover from this case if it were to happen.
+
+3. **System Safety**: If a decision made by software will make data more 
durable a single trigger is sufficient. If a decision can potentially reduce 
durability of data or execute an unsafe operation (unlink, trim, delete) then 
the confidence level has to be high, the clarity of the decision precise and 
clear and preferably the decision is made within services that have a wider 
view of the cluster (SCM/Recon).
+
+4. **Datanode Simplicity**: Datanodes should only be responsible for safe 
decisions and eager to make safe choices, avoiding unsafe autonomy.
+
+## Assumptions
+
+1. A closed container will not accept new blocks from clients.
+2. Empty containers are excluded in this proposal. Empty containers create 
unnecessary noise and overhead in the system but are not relevant to the 
durability of existing data.
+3. If checksums of chunks match locally they should match across replicas.
+4. The longest block is always the correct block to preserve at the Datanode 
level based on the limited information a single Datanode has. Whether the data 
within a block is ever accessed by the client depends on the consistency 
between Datanode and Ozone Manager which is not transactional and can vary. The 
safe decision is to preserve the longest block and let an external entity that 
process cluster wide usage of blocks decide if the block can be trimmed or 
deleted. The longest block may contain uncommitted chunks, but the datanodes 
have no way to verify this and must be defensive about preserving their data.
+
+## Solution Proposal
+
+The proposed solution involves defining a container level checksum that can be 
used to quickly tell if two containers replicas match or not based on their 
data. This container checksum can be defined as a three level Merkle tree:
+
+1. Level 1 (leaves): The existing chunk level checksums (written by the client 
and verified by the existing datanode container scanner).
+2. Level 2: A block level checksum created by hashing all the chunk checksums 
within the block.
+3. Level 3 (root): A container level checksum created by hashing all the block 
checksums within the container.
+   1. This top level container hash is what is reported to SCM to detect 
diverged replicas. SCM does not need block or chunk level hashes.
+
+When SCM sees that replicas of a non-open container have diverged container 
checksums, it can trigger a reconciliation process on all datanode replicas. 
SCM does not need to know which container hash is correct (if any of them are 
correct), only that all containers match. Datanodes will use their merkle tree 
and those of the other replicas to identify issues with their container. Next, 
datanodes can read the missing data from existing replicas and use it to repair 
their container replica.
+
+Since the container hash is generated leveraging the checksum recorded at the 
time of writing, the container hash represents consistency of the data from a 
client perspective.
+
+### Phase I (outlined in this document)
+
+1. Add container level checksums that datanodes can compute and store.
+
+2. Add a mechanism for datanodes to reconcile their replica of a container 
with another datanode's replica so that both replicas can be verified to be 
equal at the end of the process.
+
+3. Add a mechanism for SCM to trigger this reconciliation as part of the 
existing heartbeat command protocol SCM uses to communicate with datanodes.
+
+4. An `ozone admin container reconcile <container-id>` CLI that can be used to 
manually resolve diverged container states among non-open container replicas.
+    - When SCM gets this command, it would trigger one reconcile command for 
each replica. The CLI would be asynchronous so progress could be checked using 
container level checksum info added to `ozone admin container info` output.
+
+5. Delete blocks that a Container Replica has not yet deleted.
+
+### Phase II (out of scope for this document)
+
+- Automate container reconciliation requests as part of SCM's replication 
manager.
+
+- Simplify SCM replication manager decommission and recovery logic based on 
mismatch of container checksums, instead of the combination of all possible 
container states.
+
+### Phase III (out of scope for this document)
+
+- Extend container level checksum verification to erasure coded containers.
+    - EC container replicas do not have the same data and are not expected to 
have matching container checksums.
+    - EC containers already use offline recovery as a reconciliation mechanism.
+
+
+## Solution Implementation
+
+### Container Hash Tree / Merkle Tree
+
+The only extra information we will need to store is the container merkle tree 
on each datanode container replica. The current proposal is store this 
separately as a proto file on disk so that it can be copied over the network 
exactly as stored. The structure would look something like this (not finalized, 
for illustrative purposes only):
+
+```diff
+diff --git 
a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto 
b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
+index 718e2a108c7..d8d508af356 100644
+--- a/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
++++ b/hadoop-hdds/interface-client/src/main/proto/DatanodeClientProtocol.proto
+@@ -382,6 +382,7 @@ message ChunkInfo {
+   repeated KeyValue metadata = 4;
+   required ChecksumData checksumData =5;
+   optional bytes stripeChecksum = 6;
++  optional bool healthy = 7; // If all the chunks on disk match the expected 
checksums provided by the client during write
+ }
+ 
+ message ChunkInfoList {
+@@ -525,3 +526,38 @@ service IntraDatanodeProtocolService {
+   rpc download (CopyContainerRequestProto) returns (stream 
CopyContainerResponseProto);
+   rpc upload (stream SendContainerRequest) returns (SendContainerResponse);
+ }
++
++/*
++BlockMerkle tree stores the checksums of the chunks in a block.
++The Block checksum is derived from the checksums of the chunks in case of 
replicated blocks and derived from the
++metadata of the chunks in case of erasure coding.
++Two Blocks across container instances on two nodes have the same checksum if 
they have the same set of chunks.
++A Block upon deletion will be marked as deleted but will preserve the rest of 
the metadata.
++*/
++message BlockMerkleTree {
++  optional BlockData blockData = 1; // The chunks in this should be sorted by 
the order of chunks written.
++  optional ChecksumData checksumData = 2; // Checksum of the checksums of the 
chunks.
++  optional bool deleted = 3; // If the block is deleted.
++  optional int64 length = 4; // Length of the block.
++  optional int64 chunkCount = 5; // Number of chunks in the block.
++}
++
++/*
++ContainerMerkleTree stores the checksums of the blocks in a container.
++The Container checksum is derived from the checksums of the blocks.
++Two containers across container instances on two nodes have the same checksum 
if they have the same set of blocks.
++If a block is deleted within the container, the checksum of the container 
will remain unchanged.
++ */
++message ContainerMerkleTree {
++  enum FailureCause {
++    NO_HEALTHY_CHUNK_FOUND_WITH_PEERS = 1; // No healthy chunk found with 
peers.
++    NO_PEER_FOUND = 2; // No peer found.
++  }
++  optional int64 containerID = 1; // The container ID.
++  repeated BlockMerkleTree blockMerkleTrees = 2; // The blocks in this should 
be sorted by the order of blocks written.
++  optional ChecksumData checksumData = 3; // Checksum of the checksums of the 
blocks.
++  optional int64 length = 5; // Length of the container.
++  optional int64 blockCount = 6; // Number of blocks in the container.
++  optional FailureCause failureCause = 7; // The cause of the failure.
++  optional int64 reconciliationCount = 8; // The reconciliation count.
++}
+
+
+```
+This is written as a file to avoid bloating the RocksDB instance which is in 
the IO path.
+
+## APIs
+
+The following APIs would be added to datanodes to support container 
reconciliation. The actions performed when calling them is defined in 
[Events](#events).
+
+- `reconcileContainer(containerID, List<Replica>)`
+    - Instructs a datanode to reconcile its copy of the specified container 
with the provided list of other container replicas.
+    - Datanodes will call `getContainerHashes` for each container replica to 
identify repairs needed, and use existing chunk read/write APIs to do repairs 
necessary.
+    - This would likely be a new command as part of the SCM heartbeat 
protocol, not actually a new API.
+- `getContainerHashes(containerID)`
+    - A datanode API that returns the merkle tree for a given container. The 
proto structure would be similar to that outlined in [Merkle 
Tree](#Container-Hash-Tree-/-Merkle-Tree).
+
+## Reconciliation Process
+
+SCM: Storage Container Manager
+
+DN: Datanode
+
+### SCM sets up the reconciliation process as follows:
+
+1. SCM triggers a reconciliation on `DN 1` for container 12 with replicas on 
`DN 2` and `DN 3`.
+   1. `SCM -> reconcileContainer(Container #12, DN2, DN3) -> DN 1` 
+
+### Datanodes set up the reconciliation process as follows:
+
+1. `DN 1` schedules the Merkle Tree to be calculated if it is not already 
present and report it to SCM. 
+   1. SCM can then compare the container replica hashes and schedule 
reconciliation if they are different.
+2. If `DN 1` already has the Merkle Tree locally, it will compare it with the 
Merkle Trees of the other container replicas and schedule reconciliation if 
they are different. Example: 
+   1. `DN 1 -> getContainerHashes(Container #12) -> DN 2` // Datanode 1 gets 
the merkle tree of container 12 from Datanode 2.
+   2. `DN 1 -> getContainerHashes(Container #12) -> DN 3` // Datanode 1 gets 
the merkle tree of container 12 from Datanode 3.
+   3. ... // Continue for all replicas.
+
+### Reconcile loop once the merkle trees are obtained from all/most replicas:
+
+1. `DN 1` checks if any blocks are missing. For each missing Block:

Review Comment:
   What's the criteria to determine whether a block is missing, or merely an 
extra not needed block?  The question is what's the rule to decide which 
replica is the baseline replica.  
   
   If any new blocks are added locally, container metadata in rocksdb need be 
updated accordingly.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to