(ozone-site) branch HDDS-9225-website-v2 updated: HDDS-14264. [Website v2] [Docs] [Core Concepts] Erasure Coding (#183)

sarvekshayr Wed, 31 Dec 2025 02:22:32 -0800

This is an automated email from the ASF dual-hosted git repository.

sarvekshayr pushed a commit to branch HDDS-9225-website-v2
in repository https://gitbox.apache.org/repos/asf/ozone-site.git



The following commit(s) were added to refs/heads/HDDS-9225-website-v2 by this 
push:
     new b483f36a8 HDDS-14264. [Website v2] [Docs] [Core Concepts] Erasure 
Coding (#183)
b483f36a8 is described below

commit b483f36a82a546b82a10383cf753b35a9cd049e9
Author: KUAN-HAO HUANG <[email protected]>
AuthorDate: Wed Dec 31 18:22:21 2025 +0800

    HDDS-14264. [Website v2] [Docs] [Core Concepts] Erasure Coding (#183)
---
 cspell.yaml                                        |   2 +
 .../02-replication/04-erasure-coding.md            | 119 ++++++++++++++++++++-
 .../02-replication/EC-Chunk-Layout.png             | Bin 0 -> 179275 bytes
 .../02-replication/EC-Reads-With-No-Failures.png   | Bin 0 -> 194800 bytes
 .../02-replication/EC-Reconstructional-Read.png    | Bin 0 -> 221681 bytes
 .../EC-Write-Block-Allocation-in-Containers.png    | Bin 0 -> 194715 bytes
 6 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/cspell.yaml b/cspell.yaml
index 1e459cebe..47b35a13f 100644
--- a/cspell.yaml
+++ b/cspell.yaml
@@ -95,6 +95,8 @@ words:
 - HDFS
 - Protolock
 - CSI
+- Quantcast
+- Xceiver
 - matomo
 - qube
 - proto
diff --git a/docs/03-core-concepts/02-replication/04-erasure-coding.md 
b/docs/03-core-concepts/02-replication/04-erasure-coding.md
index 4947b7383..eadbc1927 100644
--- a/docs/03-core-concepts/02-replication/04-erasure-coding.md
+++ b/docs/03-core-concepts/02-replication/04-erasure-coding.md
@@ -1,3 +1,120 @@
 # Erasure Coding
 
-**TODO:** File a subtask under 
[HDDS-9857](https://issues.apache.org/jira/browse/HDDS-9857) and complete this 
page or section.
+## Background
+
+Distributed systems basic expectation is to provide the data durability.
+To provide the higher data durability, many popular storage systems use 
replication
+approach which is expensive. The Apache Ozone supports `RATIS/THREE` 
replication scheme.
+The Ozone default replication scheme `RATIS/THREE` has 200% overhead in storage
+space and other resources (e.g., network bandwidth).
+However, for warm and cold datasets with relatively low I/O activities, 
additional
+block replicas are rarely accessed during normal operations, but still consume 
the same
+amount of resources as the first replica.
+
+Therefore, a natural improvement is to use Erasure Coding (EC) in place of 
replication,
+which provides the same level of fault-tolerance with much less storage space.
+In typical EC setups, the storage overhead is no more than 50%. The 
replication factor of an EC file is meaningless.
+Instead of replication factor, we introduced ReplicationConfig interface to 
specify the required type of replication,
+either `RATIS/THREE` or `EC`.
+
+Integrating EC with Ozone can improve storage efficiency while still providing 
similar
+data durability as traditional replication-based Ozone deployments.
+As an example, a 3x replicated file with 6 blocks will consume 6*3 = `18` 
blocks of disk space.
+But with EC (6 data, 3 parity) deployment, it will only consume `9` blocks of 
disk space.
+
+## Architecture
+
+The storage data layout is a key factor in the implementation of EC. After 
deep analysis
+and several technical consideration, the most fitting data layout is striping 
model.
+The data striping layout is not new. The striping model already adapted by 
several other
+file systems(Ex: Quantcast File System, Hadoop Distributed File System etc) 
successfully before.
+
+For example, with the EC (6 data, 3 parity) scheme, the data chunks will be 
distributed to first 6 data nodes in order
+and then client generates the 3 parity chunks and transfer to remaining 3 
nodes in order.
+These 9 chunks together we call as "Stripe". Next 6 chunks will be distributed 
to the same first 6 data nodes again
+and the parity to remaining 3 nodes. These 9 data nodes stored blocks together 
called as "BlockGroup".
+
+If the application is continuing to write beyond the size of `6 * BLOCK_SIZE`, 
then client will request new block group from Ozone Manager.
+
+### Erasure Coding Write
+
+The core logic of erasure coding writes are placed at Ozone client.
+When client creates the file, Ozone Manager allocates the block group(`d + p`)
+number of nodes from the pipeline provider and return the same to client.
+As data is coming in from the application, client will write first d number of 
chunks
+to d number of data nodes in block group. It will also cache the d number 
chunks
+to generate the parity chunks. Once parity chunks generated, it will transfer 
the
+same to the remaining p nodes in order. Once all blocks reached their 
configured sizes,
+client will request the new block group nodes.
+
+Below diagram depicts the block allocation in containers as logical groups.
+For interest of space, we assumed EC(3, 2) Replication Config for the diagram.
+
+![EC Block Allocation in 
Containers](EC-Write-Block-Allocation-in-Containers.png)
+
+Let's zoom out the blockID: 1 data layout from the above picture, that showed 
in the following picture.
+This picture shows how the chunks will be laid out in data node blocks.
+
+![EC Chunk Layout](EC-Chunk-Layout.png)
+
+Currently, the EC client re-used the data transfer end-points to transfer the 
data to data nodes.
+The XceiverClientGRPC client used for writing data and putBlock info.
+The Datanode side changes are minimal as we reused the same existing transfer 
protocols.
+The EC data block written at the Datanode is same as any other block in non-EC 
mode.
+In a single block group, container id numbers are same in all nodes. A file 
can have multiple block groups.
+Each block group will have `d+p` number of block and all ids are same.
+
+**d** - Number of data blocks in a block group
+
+**p** - Number of parity blocks in a block group
+
+### Erasure Coding Read
+
+For reads, OM will provide the node location details as part of key lookup.
+If the key is erasure coded, Ozone client reads it in EC fashion. Since the 
data layout
+is different(see the previous section about write path), reads should consider 
the layout and do the reads accordingly.
+
+The EC client will open the connections to DNs based on the expected 
locations. When all data locations are available,
+it will attempt to do plain reads chunk by chunk in round robin fashion from d 
data blocks.
+
+Below picture shows the order when there are no failures while reading.
+
+![EC Reads With no Failures](EC-Reads-With-No-Failures.png)
+
+Until it sees read failures, there is no need of doing EC reconstruction.
+
+#### Erasure Coding On-the-fly Reconstruction Reads
+
+When client detects there are failures while reading or when starting the 
reads,
+Ozone EC client is capable of reconstructing/recovering the lost data by doing 
the EC decoding.
+To do the EC decoding it needs to read parity replicas. This is a degraded 
read as it needs to do reconstruction.
+This reconstruction is completely transparent to the applications.
+
+Below picture depicts how it uses parity replicas in reconstruction.
+
+![EC Reconstructional Reads](EC-Reconstructional-Read.png)
+
+### Erasure Coding Replication Config
+
+Apache Ozone built with the pure 'Object Storage' semantics. However, many big 
data
+eco system projects still uses file system APIs. To provide both worlds best 
access to Ozone,
+it's provided both faces of interfaces. In both cases, keys/files would be 
written into buckets under the hood.
+So, EC Replication Configs can be set at bucket level.
+The EC policy encapsulates how to encode/decode a file.
+
+Each EC Replication Config defined by the following pieces of information:
+
+ 1. **data:** Data blocks number in an EC block group.
+ 2. **parity:** Parity blocks number in an EC block group.
+ 3. **ecChunkSize:** The size of a striping chunk. This determines the 
granularity of striped reads and writes.
+ 4. **codec:** This is to indicate the type of EC algorithms (e.g., 
`RS`(Reed-Solomon), `XOR`).
+
+To pass the EC Replication Config in command line or configuration files, we 
need to use the following format:
+*codec*-*num data blocks*-*num parity blocks*-*EC chunk size*
+
+Currently, there are three built-in EC Replication Configs supported: 
`RS-3-2-1024k`, `RS-6-3-1024k`, `XOR-2-1-1024k`.
+The most recommended option is `RS-6-3-1024k`. When a key/file created without 
specifying the Replication Config,
+it inherits the EC Replication Config of its bucket if it's available.
+
+Changing the bucket level EC Replication Config only affect new files created 
within the bucket.
+Once a file has been created, its EC Replication Config cannot be changed 
currently.
diff --git a/docs/03-core-concepts/02-replication/EC-Chunk-Layout.png 
b/docs/03-core-concepts/02-replication/EC-Chunk-Layout.png
new file mode 100644
index 000000000..d7b678526
Binary files /dev/null and 
b/docs/03-core-concepts/02-replication/EC-Chunk-Layout.png differ
diff --git a/docs/03-core-concepts/02-replication/EC-Reads-With-No-Failures.png 
b/docs/03-core-concepts/02-replication/EC-Reads-With-No-Failures.png
new file mode 100644
index 000000000..7dd4cab6e
Binary files /dev/null and 
b/docs/03-core-concepts/02-replication/EC-Reads-With-No-Failures.png differ
diff --git a/docs/03-core-concepts/02-replication/EC-Reconstructional-Read.png 
b/docs/03-core-concepts/02-replication/EC-Reconstructional-Read.png
new file mode 100644
index 000000000..427e73a2a
Binary files /dev/null and 
b/docs/03-core-concepts/02-replication/EC-Reconstructional-Read.png differ
diff --git 
a/docs/03-core-concepts/02-replication/EC-Write-Block-Allocation-in-Containers.png
 
b/docs/03-core-concepts/02-replication/EC-Write-Block-Allocation-in-Containers.png
new file mode 100644
index 000000000..3945662c0
Binary files /dev/null and 
b/docs/03-core-concepts/02-replication/EC-Write-Block-Allocation-in-Containers.png
 differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(ozone-site) branch HDDS-9225-website-v2 updated: HDDS-14264. [Website v2] [Docs] [Core Concepts] Erasure Coding (#183)

Reply via email to