This is an automated email from the ASF dual-hosted git repository.
weichiu pushed a commit to branch HDDS-9225-website-v2
in repository https://gitbox.apache.org/repos/asf/ozone-site.git
The following commit(s) were added to refs/heads/HDDS-9225-website-v2 by this
push:
new 5309c786d HDDS-14488. [Website v2] [Docs] [System Internals] Datanode
container replication (#279)
5309c786d is described below
commit 5309c786db61d29a5ddf61e4c49c00e329879185
Author: Gargi Jaiswal <[email protected]>
AuthorDate: Mon Jan 26 22:43:23 2026 +0530
HDDS-14488. [Website v2] [Docs] [System Internals] Datanode container
replication (#279)
---
cspell.yaml | 1 +
.../02-data/02-containers/08-replication.md | 154 ++++++++++++++++++++-
2 files changed, 153 insertions(+), 2 deletions(-)
diff --git a/cspell.yaml b/cspell.yaml
index 5aa1022f9..77ad7f9dd 100644
--- a/cspell.yaml
+++ b/cspell.yaml
@@ -183,6 +183,7 @@ words:
- ISA
- doublebuffer
- untars
+- Untar
- jdbc
- JDBC
- MIS
diff --git
a/docs/07-system-internals/04-replication/02-data/02-containers/08-replication.md
b/docs/07-system-internals/04-replication/02-data/02-containers/08-replication.md
index 8e4e3ddf4..465400a1a 100644
---
a/docs/07-system-internals/04-replication/02-data/02-containers/08-replication.md
+++
b/docs/07-system-internals/04-replication/02-data/02-containers/08-replication.md
@@ -4,6 +4,156 @@ sidebar_label: Replication
# Container Replication
-**TODO:** File a subtask under
[HDDS-9862](https://issues.apache.org/jira/browse/HDDS-9862) and complete this
page or section.
+## Overview
-Document replication process among Datanodes. Advantages of push replication
and cases where an EC container can be replicated, like decommissioning.
+Container replication is a critical mechanism in Apache Ozone that ensures
data availability and durability by copying containers from source Datanodes to
destination Datanodes. This document provides a comprehensive description of
the replication process, including the detailed steps involved, advantages of
push replication, and scenarios where EC (Erasure Coded) containers can be
replicated.
+
+## Replication Mode
+
+Apache Ozone supports **Push Replication** by **default**, where the source
Datanode actively pushes the container to the target Datanode. The replication
mode is controlled by the configuration property `hdds.scm.replication.push`
(default: `true`). When set to `false`, the system uses pull replication where
the target Datanode pulls from source Datanodes.
+
+**Push Replication :** `PushReplicator` class handles push replication by:
+
+- Using `OnDemandContainerReplicationSource` to prepare the container
+- Using `GrpcContainerUploader` to upload the container via gRPC stream
+- Streaming the container data directly to the target Datanode
+
+:::note
+
+Both regular container replication and EC container replication respect the
same `hdds.scm.replication.push` configuration setting. EC container
replication scenarios (decommissioning, under-replication, maintenance mode,
mis-replication) will use push mode when the configuration is `true` (default)
or pull mode when set to `false`.
+
+:::
+
+---
+
+## Detailed Replication Process
+
+The container replication process involves several well-defined steps.
+
+### Step 1: Source Datanode Prepares Container Tarball
+
+The source Datanode creates a tarball containing:
+
+- Container descriptor file (`container.yaml`) with metadata
+- RocksDB metadata files (database files)
+- Container chunk files (actual data)
+- Container checksum file (if exists)
+
+**Compression:** This tarball is not compressed by default. Optional
compression can be enabled via `hdds.container.replication.compression`
(values: `NO_COMPRESSION`(default), `GZIP`, `SNAPPY`, `LZ4`, `ZSTD`).
+
+### Step 2: Destination Datanode Receives Tarball
+
+The source Datanode streams the tarball to the destination via gRPC:
+
+- Establishes gRPC stream connection
+- Streams data in chunks via `SendContainerRequest` messages
+- Destination writes chunks to temporary file in the temp dir of the volume
chosen: `<volume-root>/tmp/container-copy/`.
+ That way, Datanode parallelizes container download, and does not block on
the system root drive.
+
+Before receiving, the destination selects a volume, **reserves space (2x
container size)** to accommodate both the tarball and the extracted files, and
creates the temporary directory.
+
+### Step 3: Untar and Store Container Files
+
+The destination Datanode extracts the tarball:
+
+- Reads and verifies container descriptor (`container.yaml`)
+- Extracts RocksDB metadata to metadata directory
+- Extracts chunk files to chunks directory
+- Extracts checksum file (if present) to metadata directory
+
+Files are extracted to a temporary directory first, then moved atomically to
final locations: `<volume-root>/containerID/`,
`<volume-root>/containerID/metadata/`, and `<volume-root>/containerID/chunks/`.
+
+### Step 4: Import Container
+
+The container is imported into the Datanode's container set:
+
+- Creates container object from metadata
+- Sets container state to `RECOVERING`
+- Associates container with selected volume
+- Adds container to `ContainerSet`
+- Updates volume usage with container size
+- Schedules on-demand scan
+
+Import progress is tracked to prevent concurrent imports. If import fails or
container already exists, appropriate error handling is performed.
+
+### Step 5: Delete Temporary Files
+
+After successful import, all temporary files are cleaned up:
+
+1. **Delete Tarball**: The downloaded tarball file is deleted from the
temporary directory
+2. **Release Reserved Space**: The reserved space on the volume is released
+3. **Cleanup on Failure**: If any step fails, temporary files are deleted and
reserved space is released
+
+---
+
+## Advantages of Push Replication
+
+Push replication offers several advantages:
+
+- **Better Load Distribution**: Source Datanode controls transfer rate and
timing, managing its own load while simplifying target operations
+- **Improved Network Efficiency**: Direct streaming with gRPC flow control
adapts to network conditions, reducing latency
+- **Simplified Failure Handling**: Source handles retries and error recovery,
while targets only process incoming streams
+- **Better Resource Management**: Source can throttle transfers based on its
own capacity (disk I/O, CPU, network), preventing overload
+
+---
+
+## EC Container Replication Scenarios
+
+Erasure Coded (EC) containers have specific replication requirements and
scenarios where replication is necessary.
+
+### Scenario 1: Decommissioning
+
+**When it occurs:**
+When a Datanode enters the decommissioning state, all EC container replicas
stored on that Datanode must be replicated to other Datanodes before the
Datanode can be safely removed from the cluster.
+
+**How it works:**
+
+1. **Detection**: `ECUnderReplicationHandler` detects containers with replicas
on decommissioning Datanodes
+2. **Index Identification**: The handler identifies which EC indexes are only
present on decommissioning Datanodes (`decommissioningOnlyIndexes()`)
+3. **One-to-One Replication**: For each decommissioning index, a replication
command is created to copy that specific index to a new Datanode
+4. **Target Selection**: New target Datanodes are selected based on placement
policies
+5. **Replication Execution**: Each index is replicated independently using the
configured replication mode (push by default, configurable to pull via
`hdds.scm.replication.push`)
+
+**Example:**
+
+For an EC container with replication config `RS-6-3-1024k`:
+
+- 6 data blocks + 3 parity blocks = 9 total indexes
+- If index 2 is only on a decommissioning Datanode, only index 2 needs to be
replicated
+- The replication command includes `replicaIndex=2` to specify which index to
copy
+
+### Scenario 2: Under-Replication
+
+**When it occurs:**
+When an EC container has fewer replicas than required for a specific index,
that index needs to be replicated.
+
+**How it works:**
+
+1. **Replica Count Analysis**: `ECContainerReplicaCount` analyzes the current
replica distribution
+2. **Missing Index Detection**: Identifies indexes that have fewer replicas
than required
+3. **Source Selection**: Selects healthy source replicas for the missing
indexes
+4. **Replication Commands**: Creates replication commands to restore
redundancy using the configured replication mode (push by default)
+
+### Scenario 3: Maintenance Mode
+
+**When it occurs:**
+When Datanodes enter maintenance mode, EC container replicas on those
Datanodes may need additional copies to maintain redundancy during maintenance.
+
+**How it works:**
+
+Similar to decommissioning, but with different redundancy requirements:
+
+- Maintenance mode allows for reduced redundancy
(`maintenanceRemainingRedundancy` config)
+- Replication ensures minimum redundancy is maintained during maintenance
using the configured replication mode (push by default)
+
+### Scenario 4: Mis-Replication
+
+**When it occurs:**
+When EC container replicas are placed on Datanodes that violate placement
policies (e.g., too many replicas in the same rack).
+
+**How it works:**
+
+1. **Placement Validation**: `ECMisReplicationHandler` validates replica
placement against policies
+2. **Violation Detection**: Identifies replicas that violate placement
constraints
+3. **Replication to Correct Placement**: Creates replication commands to move
replicas to compliant locations using the configured replication mode (push by
default)
+4. **Old Replica Deletion**: After successful replication, old replicas are
deleted
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]