szetszwo commented on code in PR #6121: URL: https://github.com/apache/ozone/pull/6121#discussion_r1483549548
########## hadoop-hdds/docs/content/design/container-reconciliation.md: ########## @@ -0,0 +1,246 @@ +--- +title: Container Reconciliation +summary: Allow Datanodes to reconcile mismatched container contents regardless of their state. +date: 2024-01-29 +jira: HDDS-10239 +status: draft +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + + +# Container Reconciliation + +This document outlines the proposed recovery protocol for containers where one or more replicas are not cleanly closed or have potential data inconsistencies. It aims to provide an overview of the planned changes and their implications, focusing on the overall flow and key design decisions. + +## Background + +Ideally, a healthy Ozone cluster would contain only open and closed containers. However, container replicas commonly end up with a mix of states including quasi-closed and unhealthy that the current system is not able to resolve to cleanly closed replicas. The cause of these states is often bugs or broad failure handling on the write path. While we should fix these causes, they raise the problem that Ozone is not able to reconcile these mismatched container states on its own, regardless of their cause. This has lead to significant complexity in the replication manager for how to handle cases where only quasi-closed and unhealthy replicas are availalbe, especially in the case of decommissioning. + +Even when all replicas are closed, the system assumes that these closed container replicas are equal with no way to verify this. Checksumming is done for individual chunks within each container, but if two container replicas somehow end up with chunks that differ in length or content despite being marked closed with local checksums matching, the system has no way to detect or resolve this anomaly. + +This document proposes a container reconciliation protocol to solve these problems. After implementing the proposal: +1. It should be possible for a cluster to progress to a state where it has only properly replicated closed and open containers. +2. We can verify the equality and integrity of all closed containers. + +## Guiding Principles + +1. **User Focus**: Users prioritize data durability and availability above all else. + - From the user perspective, containers labelled quasi-closed and unhealthy represent compromised durability and availability, regardless of the container's actual contents. + +2. **Focus on Recovery Paths**: Focusing on the path to a failed state is secondary to focusing on the path out of failed states. + - For example, we should not focus on whether it is possible for two replicated closed containers with locally matching chunk checksums to have differing content, only on whether the system could detect and recover from this case if it were to happen. + +3. **System Safety**: If a decision made by software will make data more durable a single trigger is sufficient. If a decision can potentially reduce durability of data or execute an unsafe operation (unlink, trim, delete) then the confidence level has to be high, the clarity of the decision precise and clear and preferably the decision is made within services that have a wider view of the cluster (SCM/Recon). + +4. **Datanode Simplicity**: Datanodes should only be responsible for safe decisions and eager to make safe choices, avoiding unsafe autonomy. + +## Assumptions + +- A closed container will not accept new blocks from clients. Review Comment: I agree that it is a long term thing and not in the scope of this. However, we should NOT design something today preventing the long term direction. Let me ask a different question -- why we need this assumption in this design? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
