kerneltime commented on code in PR #6121: URL: https://github.com/apache/ozone/pull/6121#discussion_r1488642693
########## hadoop-hdds/docs/content/design/container-reconciliation.md: ########## @@ -0,0 +1,246 @@ +--- +title: Container Reconciliation +summary: Allow Datanodes to reconcile mismatched container contents regardless of their state. +date: 2024-01-29 +jira: HDDS-10239 +status: draft +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + + +# Container Reconciliation + +This document outlines the proposed recovery protocol for containers where one or more replicas are not cleanly closed or have potential data inconsistencies. It aims to provide an overview of the planned changes and their implications, focusing on the overall flow and key design decisions. + +## Background + +Ideally, a healthy Ozone cluster would contain only open and closed containers. However, container replicas commonly end up with a mix of states including quasi-closed and unhealthy that the current system is not able to resolve to cleanly closed replicas. The cause of these states is often bugs or broad failure handling on the write path. While we should fix these causes, they raise the problem that Ozone is not able to reconcile these mismatched container states on its own, regardless of their cause. This has lead to significant complexity in the replication manager for how to handle cases where only quasi-closed and unhealthy replicas are availalbe, especially in the case of decommissioning. + +Even when all replicas are closed, the system assumes that these closed container replicas are equal with no way to verify this. Checksumming is done for individual chunks within each container, but if two container replicas somehow end up with chunks that differ in length or content despite being marked closed with local checksums matching, the system has no way to detect or resolve this anomaly. + +This document proposes a container reconciliation protocol to solve these problems. After implementing the proposal: +1. It should be possible for a cluster to progress to a state where it has only properly replicated closed and open containers. +2. We can verify the equality and integrity of all closed containers. + +## Guiding Principles + +1. **User Focus**: Users prioritize data durability and availability above all else. + - From the user perspective, containers labelled quasi-closed and unhealthy represent compromised durability and availability, regardless of the container's actual contents. + +2. **Focus on Recovery Paths**: Focusing on the path to a failed state is secondary to focusing on the path out of failed states. + - For example, we should not focus on whether it is possible for two replicated closed containers with locally matching chunk checksums to have differing content, only on whether the system could detect and recover from this case if it were to happen. + +3. **System Safety**: If a decision made by software will make data more durable a single trigger is sufficient. If a decision can potentially reduce durability of data or execute an unsafe operation (unlink, trim, delete) then the confidence level has to be high, the clarity of the decision precise and clear and preferably the decision is made within services that have a wider view of the cluster (SCM/Recon). + +4. **Datanode Simplicity**: Datanodes should only be responsible for safe decisions and eager to make safe choices, avoiding unsafe autonomy. + +## Assumptions + +- A closed container will not accept new blocks from clients. + +- Empty containers are excluded in this proposal. Empty containers create unnecessary noise and overhead in the system but are not relevant to the durability of existing data. + +- The longest block is always the correct block to preserve at the Datanode level based on the limited information a single Datanode has about the higher level business logic that is storing data. + + - The longest block may contain uncommitted chunks, but the datanodes have no way to verify this and must be defensive about preserving their data. + +## Solution Proposal + +The proposed solution involves defining a container level checksum that can be used to quickly tell if two containers replicas match or not based on their data. This container checksum can be defined as a three level Merkle tree: + +- Level 1 (leaves): The existing chunk level checksums (written by the client and verified by the existing datanode container scanner). +- Level 2: A block level checksum created by hashing all the chunk checksums within the block. +- Level 3 (root): A container level checksum created by hashing all the block checksums within the container. + - This top level container hash is what is reported to SCM to detect diverged replicas. SCM does not need block or chunk level hashes. + +When SCM sees that replicas of a non-open container have diverged container checksums, it can trigger a reconciliation process on all datanode replicas. SCM does not need to know which container hash is correct (if any of them are correct), only that all containers match. Datanodes will use their merkle tree and those of the other replicas to identify issues with their container. Next, datanodes can read the missing data from existing replicas and use it to repair their container replica. + +### Phase I (outlined in this document) + +- Add container level checksums that datanodes can compute and store. + +- Add a mechanism for datanodes to reconcile their replica of a container with another datanode's replica so that both replicas can be verified to be equal at the end of the process. + +- Add a mechanism for SCM to trigger this reconciliation as part of the existing heartbeat command protocol SCM uses to communicate with datanodes. + +- Add an `ozone admin container reconcile <container-id>` CLI that can be used to manually resolve diverged container states among non-open container replicas. + +### Phase II (out of scope for this document) + +- Automate container reconciliation requests as part of SCM's replication manager. + +- Simplify SCM replication manager decommission and recovery logic based on mismatch of container checksums, instead of the combination of all possible container states. + +### Phase III (out of scope for this document) + +- Extend container level checksum verification to erasure coded containers. + - EC container replicas do not have the same data and are not expected to have matching container checksums. + - EC containers already use offline recovery as a reconciliation mechanism. + +## Solution Implementation + +### Storage + +The only extra information we will need to store is the container merkle tree on each datanode container replica. The current proposal is store this separately as a proto file on disk so that it can be copied over the network exactly as stored. The structure would look something like this (not finalized, for illustrative purposes only): + +``` +ContainerChecksum: { + algorithm: CRC32 + checksum: 12345 + repeated BlockChecksum +} +BlockChecksum: { + algorithm: CRC32 + checksum: 12345 + length: 5 + deleted: false Review Comment: Not sure I follow the recommendation for how to track deletions beyond adding a flag to indicate it is deleted or how the guarantees would be more robust. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
