szetszwo commented on code in PR #6121:
URL: https://github.com/apache/ozone/pull/6121#discussion_r1483549548


##########
hadoop-hdds/docs/content/design/container-reconciliation.md:
##########
@@ -0,0 +1,246 @@
+---
+title: Container Reconciliation
+summary: Allow Datanodes to reconcile mismatched container contents regardless 
of their state.
+date: 2024-01-29
+jira: HDDS-10239
+status: draft
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+# Container Reconciliation
+
+This document outlines the proposed recovery protocol for containers where one 
or more replicas are not cleanly closed or have potential data inconsistencies. 
It aims to provide an overview of the planned changes and their implications, 
focusing on the overall flow and key design decisions.
+
+## Background
+
+Ideally, a healthy Ozone cluster would contain only open and closed 
containers. However, container replicas commonly end up with a mix of states 
including quasi-closed and unhealthy that the current system is not able to 
resolve to cleanly closed replicas. The cause of these states is often bugs or 
broad failure handling on the write path. While we should fix these causes, 
they raise the problem that Ozone is not able to reconcile these mismatched 
container states on its own, regardless of their cause. This has lead to 
significant complexity in the replication manager for how to handle cases where 
only quasi-closed and unhealthy replicas are availalbe, especially in the case 
of decommissioning.
+
+Even when all replicas are closed, the system assumes that these closed 
container replicas are equal with no way to verify this. Checksumming is done 
for individual chunks within each container, but if two container replicas 
somehow end up with chunks that differ in length or content despite being 
marked closed with local checksums matching, the system has no way to detect or 
resolve this anomaly.
+
+This document proposes a container reconciliation protocol to solve these 
problems. After implementing the proposal:
+1. It should be possible for a cluster to progress to a state where it has 
only properly replicated closed and open containers.
+2. We can verify the equality and integrity of all closed containers.
+
+## Guiding Principles
+
+1. **User Focus**: Users prioritize data durability and availability above all 
else.
+   - From the user perspective, containers labelled quasi-closed and unhealthy 
represent compromised durability and availability, regardless of the 
container's actual contents.
+
+2. **Focus on Recovery Paths**: Focusing on the path to a failed state is 
secondary to focusing on the path out of failed states.
+    - For example, we should not focus on whether it is possible for two 
replicated closed containers with locally matching chunk checksums to have 
differing content, only on whether the system could detect and recover from 
this case if it were to happen.
+
+3. **System Safety**:  If a decision made by software will make data more 
durable a single trigger is sufficient. If a decision can potentially reduce 
durability of data or execute an unsafe operation (unlink, trim, delete) then 
the confidence level has to be high, the clarity of the decision precise and 
clear and preferably the decision is made within services that have a wider 
view of the cluster (SCM/Recon).
+
+4. **Datanode Simplicity**: Datanodes should only be responsible for safe 
decisions and eager to make safe choices, avoiding unsafe autonomy.
+
+## Assumptions
+
+- A closed container will not accept new blocks from clients.

Review Comment:
   I agree that it is a long term thing and not in the scope of this.  However, 
we should NOT design something today preventing the long term direction.  Let 
me ask a different question -- why we need this assumption in this design?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to