errose28 commented on code in PR #9913:
URL: https://github.com/apache/ozone/pull/9913#discussion_r2978134986


##########
hadoop-hdds/docs/content/design/incremental-container-replication.md:
##########
@@ -0,0 +1,129 @@
+---
+title: Incremental Container Replication
+summary: Allow Datanodes to catch up with missing data in containers via 
incremental replication between the higher sequence ID to lower sequence ID.
+date: 2026-03-13
+jira: HDDS-14794
+status: draft
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Incremental Container Replication
+
+## 1. Introduction
+
+Ozone currently handles container replication by transferring the entire 
container (RocksDB plus all chunk files) as a single compressed tarball. 
+While this approach guarantees that the target Datanode receives an exact copy 
of the source container, it is highly inefficient for scenarios where the 
target Datanode already possesses a slightly older version of the container 
(e.g., due to a temporary network partition, node reboot, or a Ratis follower 
falling behind).
+
+This document proposes an **Incremental Container Replication** mechanism. If 
two container replicas have two different sequence IDs, we can setup 
incremental replication between the higher sequence ID to lower sequence ID 
instead of doing full container replication. By leveraging the monotonically 
increasing `BlockCommitSequenceId` of containers, a Datanode with a stale 
container replica can request only the delta (new blocks, chunks, and 
tombstones) from a fully up-to-date source Datanode, rather than re-downloading 
the entire container.
+
+## 2. Background and Motivation
+
+When a container replica falls behind (its `BlockCommitSequenceId` is lower 
than the sequence ID on other replicas), SCM currently handles this by 
considering the stale replica as invalid. SCM will typically schedule a full 
`ReplicateContainerCommand`. 
+The target Datanode downloads the full container tarball, replacing its local 
copy entirely. 
+
+This leads to several issues:
+1.  **Network Waste**: A 5GB container might only be missing 5MB of recently 
appended data. Transferring 5GB is a 1000x overhead.
+2.  **Disk I/O and Write Amplification**: Re-writing 5GB of identical data 
wastes Disk I/O, which is especially detrimental to SSD longevity and cluster 
performance.
+3.  **Recovery Time**: In massive cluster events (e.g., a rack power cycling), 
the recovery traffic for catching up slightly stale containers can bottleneck 
the network and delay the time-to-healthy for the cluster.
+
+Since Ozone containers are generally **Append-Only** (chunks are immutable 
once written), the existing data on a stale replica is overwhelmingly likely to 
be valid and identical to the source's data up to that lower sequence ID.
+
+## 3. Incremental Replication Proposal
+
+The incremental replication mechanism allows the target Datanode to specify 
its current `BlockCommitSequenceId` when requesting a container download. The 
source Datanode will package and send only the blocks committed *after* that 
sequence ID, along with their associated chunks.

Review Comment:
   Is there any way to make this work for EC, which does not use block commit 
sequence ID?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to