kerneltime commented on code in PR #6121: URL: https://github.com/apache/ozone/pull/6121#discussion_r1506264122
########## hadoop-hdds/docs/content/design/container-reconciliation.md: ########## @@ -0,0 +1,344 @@ +--- +title: Container Reconciliation +summary: Allow Datanodes to reconcile mismatched container contents regardless of their state. +date: 2024-01-29 +jira: HDDS-10239 +status: draft +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + + +# Container Reconciliation + +This document outlines the proposed recovery protocol for containers where one or more replicas are not cleanly closed or have potential data inconsistencies. It aims to provide an overview of the planned changes and their implications, focusing on the overall flow and key design decisions. + +## Nomenclature +1. Container: A container is a logical unit of storage management in Ozone. It is a collection of blocks that are used to store data. +2. Container Replica/Instance: A container replica is a copy of a container that is stored on a Datanode or a shard of an Erasure Coded Container. +3. Block: A block is a collection of chunks that are used to store data. An Ozone object consists of one or more blocks. +4. Chunk: A chunk is a collection of bytes that are used to store data. A chunk is the smallest unit of read and write in Ozone. + +## Background + +This proposal is motivated by the need to reconcile mismatched container replica states and contents among container replicas. +This covers +1. Containers replicas that are not cleanly closed. +2. Containers replicas that have potential data inconsistencies due to bugs or broad failure handling on the write path. +3. Silent data corruption that may occur in the system. +4. The need to verify the equality and integrity of all closed containers replicas. +5. Deleted blocks within a container that still exists in some container replicas. +6. The need to simplify the replication manager for how to handle cases where only quasi-closed and unhealthy container replicas are available. + +Ideally, a healthy Ozone cluster would contain only open and closed container replicas. However, container replicas commonly end up with a mix of states including quasi-closed and unhealthy that the current system is not able to resolve to cleanly closed replicas. The cause of these states is often bugs or broad failure handling on the write path. While we should fix these causes, they raise the problem that Ozone is not able to reconcile these mismatched container replica states on its own, regardless of their cause. This has lead to significant complexity in the replication manager for how to handle cases where only quasi-closed and unhealthy replicas are available, especially in the case of decommissioning. + +Even when all container replicas are closed, the system assumes that these closed container replicas are equal with no way to verify this. During writes a client provides a checksum for the chunk that is written. +The scanner validates periodically that the checksums of the chunks on disk match the checksums provided by the client. It is possible that the checksum of a chunk on disk does not match the client provided checksum recorded at the time of write. Additionally, during container replica copying, the consistency of the data is not validated, opening the possibility of silent data corruption propagating through the system. + +This document proposes a container reconciliation protocol to solve these problems. After implementing the proposal: +1. It should be possible for a cluster to progress to a state where it has only properly replicated closed and open containers. Review Comment: ```suggestion 1. It should be possible for a cluster to progress to a state where all not open containers are closed and meeting the desired replication factor. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
