kerneltime commented on code in PR #6121:
URL: https://github.com/apache/ozone/pull/6121#discussion_r1506264122


##########
hadoop-hdds/docs/content/design/container-reconciliation.md:
##########
@@ -0,0 +1,344 @@
+---
+title: Container Reconciliation
+summary: Allow Datanodes to reconcile mismatched container contents regardless 
of their state.
+date: 2024-01-29
+jira: HDDS-10239
+status: draft
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+# Container Reconciliation
+
+This document outlines the proposed recovery protocol for containers where one 
or more replicas are not cleanly closed or have potential data inconsistencies. 
It aims to provide an overview of the planned changes and their implications, 
focusing on the overall flow and key design decisions.
+
+## Nomenclature
+1. Container: A container is a logical unit of storage management in Ozone. It 
is a collection of blocks that are used to store data.
+2. Container Replica/Instance: A container replica is a copy of a container 
that is stored on a Datanode or a shard of an Erasure Coded Container.
+3. Block: A block is a collection of chunks that are used to store data. An 
Ozone object consists of one or more blocks.
+4. Chunk: A chunk is a collection of bytes that are used to store data. A 
chunk is the smallest unit of read and write in Ozone.
+
+## Background
+
+This proposal is motivated by the need to reconcile mismatched container 
replica states and contents among container replicas.
+This covers
+1. Containers replicas that are not cleanly closed.
+2. Containers replicas that have potential data inconsistencies due to bugs or 
broad failure handling on the write path.
+3. Silent data corruption that may occur in the system.
+4. The need to verify the equality and integrity of all closed containers 
replicas.
+5. Deleted blocks within a container that still exists in some container 
replicas.
+6. The need to simplify the replication manager for how to handle cases where 
only quasi-closed and unhealthy container replicas are available.
+
+Ideally, a healthy Ozone cluster would contain only open and closed container 
replicas. However, container replicas commonly end up with a mix of states 
including quasi-closed and unhealthy that the current system is not able to 
resolve to cleanly closed replicas. The cause of these states is often bugs or 
broad failure handling on the write path. While we should fix these causes, 
they raise the problem that Ozone is not able to reconcile these mismatched 
container replica states on its own, regardless of their cause. This has lead 
to significant complexity in the replication manager for how to handle cases 
where only quasi-closed and unhealthy replicas are available, especially in the 
case of decommissioning.
+
+Even when all container replicas are closed, the system assumes that these 
closed container replicas are equal with no way to verify this. During writes a 
client provides a checksum for the chunk that is written. 
+The scanner validates periodically that the checksums of the chunks on disk 
match the checksums provided by the client. It is possible that the checksum of 
a chunk on disk does not match the client provided checksum recorded at the 
time of write. Additionally, during container replica copying, the consistency 
of the data is not validated, opening the possibility of silent data corruption 
propagating through the system.
+
+This document proposes a container reconciliation protocol to solve these 
problems. After implementing the proposal:
+1. It should be possible for a cluster to progress to a state where it has 
only properly replicated closed and open containers.

Review Comment:
   ```suggestion
   1. It should be possible for a cluster to progress to a state where all not 
open containers are closed and meeting the desired replication factor.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to