errose28 commented on code in PR #9913: URL: https://github.com/apache/ozone/pull/9913#discussion_r2978134986
########## hadoop-hdds/docs/content/design/incremental-container-replication.md: ########## @@ -0,0 +1,129 @@ +--- +title: Incremental Container Replication +summary: Allow Datanodes to catch up with missing data in containers via incremental replication between the higher sequence ID to lower sequence ID. +date: 2026-03-13 +jira: HDDS-14794 +status: draft +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Incremental Container Replication + +## 1. Introduction + +Ozone currently handles container replication by transferring the entire container (RocksDB plus all chunk files) as a single compressed tarball. +While this approach guarantees that the target Datanode receives an exact copy of the source container, it is highly inefficient for scenarios where the target Datanode already possesses a slightly older version of the container (e.g., due to a temporary network partition, node reboot, or a Ratis follower falling behind). + +This document proposes an **Incremental Container Replication** mechanism. If two container replicas have two different sequence IDs, we can setup incremental replication between the higher sequence ID to lower sequence ID instead of doing full container replication. By leveraging the monotonically increasing `BlockCommitSequenceId` of containers, a Datanode with a stale container replica can request only the delta (new blocks, chunks, and tombstones) from a fully up-to-date source Datanode, rather than re-downloading the entire container. + +## 2. Background and Motivation + +When a container replica falls behind (its `BlockCommitSequenceId` is lower than the sequence ID on other replicas), SCM currently handles this by considering the stale replica as invalid. SCM will typically schedule a full `ReplicateContainerCommand`. +The target Datanode downloads the full container tarball, replacing its local copy entirely. + +This leads to several issues: +1. **Network Waste**: A 5GB container might only be missing 5MB of recently appended data. Transferring 5GB is a 1000x overhead. +2. **Disk I/O and Write Amplification**: Re-writing 5GB of identical data wastes Disk I/O, which is especially detrimental to SSD longevity and cluster performance. +3. **Recovery Time**: In massive cluster events (e.g., a rack power cycling), the recovery traffic for catching up slightly stale containers can bottleneck the network and delay the time-to-healthy for the cluster. + +Since Ozone containers are generally **Append-Only** (chunks are immutable once written), the existing data on a stale replica is overwhelmingly likely to be valid and identical to the source's data up to that lower sequence ID. + +## 3. Incremental Replication Proposal + +The incremental replication mechanism allows the target Datanode to specify its current `BlockCommitSequenceId` when requesting a container download. The source Datanode will package and send only the blocks committed *after* that sequence ID, along with their associated chunks. Review Comment: Is there any way to make this work for EC, which does not use block commit sequence ID? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
