HDFS-9088. Cleanup erasure coding documentation. Contributed by Andrew Wang.

Change-Id: Ic3ec1f29fef0e27c46fff66fd28a51f8c4c61e71


Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/e36129b6
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/e36129b6
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/e36129b6

Branch: refs/heads/HDFS-7240
Commit: e36129b61abd9edbdd77e053a5e2bfdad434d164
Parents: ced438a
Author: Zhe Zhang <zhezh...@cloudera.com>
Authored: Thu Sep 17 09:56:32 2015 -0700
Committer: Zhe Zhang <zhezh...@cloudera.com>
Committed: Thu Sep 17 09:56:32 2015 -0700

----------------------------------------------------------------------
 .../hadoop-hdfs/CHANGES-HDFS-EC-7285.txt        |   2 +
 .../src/site/markdown/HDFSErasureCoding.md      | 123 +++++++++----------
 2 files changed, 57 insertions(+), 68 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/e36129b6/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
----------------------------------------------------------------------
diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt 
b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
index acf62cb..0345a54 100755
--- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
+++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES-HDFS-EC-7285.txt
@@ -427,3 +427,5 @@
 
     HDFS-8899. Erasure Coding: use threadpool for EC recovery tasks on 
DataNode.
     (Rakesh R via zhz)
+
+    HDFS-9088. Cleanup erasure coding documentation. (wang via zhz)

http://git-wip-us.apache.org/repos/asf/hadoop/blob/e36129b6/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
----------------------------------------------------------------------
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md 
b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
index 44c209e..3040bf5 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSErasureCoding.md
@@ -19,108 +19,95 @@ HDFS Erasure Coding
     * [Purpose](#Purpose)
     * [Background](#Background)
     * [Architecture](#Architecture)
-    * [Hardware resources](#Hardware_resources)
     * [Deployment](#Deployment)
-        * [Configuration details](#Configuration_details)
-        * [Deployment details](#Deployment_details)
+        * [Cluster and hardware 
configuration](#Cluster_and_hardware_configuration)
+        * [Configuration keys](#Configuration_keys)
         * [Administrative commands](#Administrative_commands)
 
 Purpose
 -------
-  Replication is expensive -- the default 3x replication scheme has 200% 
overhead in storage space and other resources (e.g., network bandwidth).
-  However, for “warm” and “cold” datasets with relatively low I/O 
activities, secondary block replicas are rarely accessed during normal 
operations, but still consume the same amount of resources as the primary ones.
+  Replication is expensive -- the default 3x replication scheme in HDFS has 
200% overhead in storage space and other resources (e.g., network bandwidth).
+  However, for warm and cold datasets with relatively low I/O activities, 
additional block replicas are rarely accessed during normal operations, but 
still consume the same amount of resources as the first replica.
 
-  Therefore, a natural improvement is to use Erasure Coding (EC) in place of 
replication, which provides the same level of fault tolerance with much less 
storage space. In typical Erasure Coding(EC) setups, the storage overhead is 
≤ 50%.
+  Therefore, a natural improvement is to use Erasure Coding (EC) in place of 
replication, which provides the same level of fault-tolerance with much less 
storage space. In typical Erasure Coding (EC) setups, the storage overhead is 
no more than 50%.
 
 Background
 ----------
 
   In storage systems, the most notable usage of EC is Redundant Array of 
Inexpensive Disks (RAID). RAID implements EC through striping, which divides 
logically sequential data (such as a file) into smaller units (such as bit, 
byte, or block) and stores consecutive units on different disks. In the rest of 
this guide this unit of striping distribution is termed a striping cell (or 
cell). For each stripe of original data cells, a certain number of parity cells 
are calculated and stored -- the process of which is called encoding. The error 
on any striping cell can be recovered through decoding calculation based on 
surviving data and parity cells.
 
-  Integrating the EC function with HDFS could get storage efficient 
deployments. It can provide similar data tolerance as traditional HDFS 
replication based deployments but it stores only one original replica data and 
parity cells.
-  In a typical case, A file with 6 blocks will actually be consume space of 
6*3 = 18 blocks with replication factor 3. But with EC (6 data,3 parity) 
deployment, it will only consume space of 9 blocks.
+  Integrating EC with HDFS can improve storage efficiency while still 
providing similar data durability as traditional replication-based HDFS 
deployments.
+  As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 
blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only 
consume 9 blocks of disk space.
 
 Architecture
 ------------
-  In the context of EC, striping has several critical advantages. First, it 
enables online EC which bypasses the conversion phase and immediately saves 
storage space. Online EC also enhances sequential I/O performance by leveraging 
multiple disk spindles in parallel; this is especially desirable in clusters 
with high end networking  . Second, it naturally distributes a small file to 
multiple DataNodes and eliminates the need to bundle multiple files into a 
single coding group. This greatly simplifies file operations such as deletion, 
quota reporting, and migration between federated namespaces.
+  In the context of EC, striping has several critical advantages. First, it 
enables online EC (writing data immediately in EC format), avoiding a 
conversion phase and immediately saving storage space. Online EC also enhances 
sequential I/O performance by leveraging multiple disk spindles in parallel; 
this is especially desirable in clusters with high end networking. Second, it 
naturally distributes a small file to multiple DataNodes and eliminates the 
need to bundle multiple files into a single coding group. This greatly 
simplifies file operations such as deletion, quota reporting, and migration 
between federated namespaces.
 
-  As in general HDFS clusters, small files could account for over 3/4 of total 
storage consumption. So, In this first phase of erasure coding work, HDFS 
supports striping model. In the near future, HDFS will supports contiguous 
layout as second second phase work. So this guide focuses more on striping 
model EC.
+  In typical HDFS clusters, small files can account for over 3/4 of total 
storage consumption. To better support small files, in this first phase of work 
HDFS supports EC with striping. In the future, HDFS will also support a 
contiguous EC layout. See the design doc and discussion on 
[HDFS-7285](https://issues.apache.org/jira/browse/HDFS-7285) for more 
information.
 
- *  **NameNode Extensions** - Under the striping layout, a HDFS file is 
logically composed of block groups, each of which contains a certain number of  
 internal blocks.
-   To eliminate the need for NameNode to monitor all internal blocks, a new 
hierarchical block naming protocol is introduced, where the ID of a block group 
can be inferred from any of its internal blocks. This allows each block group 
to be managed as a new type of BlockInfo named BlockInfoStriped, which tracks 
its own internal blocks by attaching an index to each replica location.
+ *  **NameNode Extensions** - Striped HDFS files are logically composed of 
block groups, each of which contains a certain number of internal blocks.
+    To reduce NameNode memory consumption from these additional blocks, a new 
hierarchical block naming protocol was introduced. The ID of a block group can 
be inferred from the ID of any of its internal blocks. This allows management 
at the level of the block group rather than the block.
 
- *  **Client Extensions** - The basic principle behind the extensions is to 
allow the client node to work on multiple internal blocks in a block group in
-    parallel.
+ *  **Client Extensions** - The client read and write paths were enhanced to 
work on multiple internal blocks in a block group in parallel.
     On the output / write path, DFSStripedOutputStream manages a set of data 
streamers, one for each DataNode storing an internal block in the current block 
group. The streamers mostly
     work asynchronously. A coordinator takes charge of operations on the 
entire block group, including ending the current block group, allocating a new 
block group, and so forth.
     On the input / read path, DFSStripedInputStream translates a requested 
logical byte range of data as ranges into internal blocks stored on DataNodes. 
It then issues read requests in
     parallel. Upon failures, it issues additional read requests for decoding.
 
- *  **DataNode Extensions** - ErasureCodingWorker(ECWorker) is for 
reconstructing erased erasure coding blocks and runs along with the Datanode 
process. Erased block details would have been found out by Namenode 
ReplicationMonitor thread and sent to Datanode via its heartbeat responses as 
discussed in the previous sections. For each reconstruction task,
-   i.e. ReconstructAndTransferBlock, it will start an internal daemon thread 
that performs 3 key tasks:
+ *  **DataNode Extensions** - The DataNode runs an additional 
ErasureCodingWorker (ECWorker) task for background recovery of failed erasure 
coded blocks. Failed EC blocks are detected by the NameNode, which then chooses 
a DataNode to do the recovery work. The recovery task is passed as a heartbeat 
response. This process is similar to how replicated blocks are re-replicated on 
failure. Reconstruction performs three key tasks:
 
-      _1.Read the data from source nodes:_ For reading the data blocks from 
different source nodes, it uses a dedicated thread pool.
-         The thread pool is initialized when ErasureCodingWorker initializes. 
Based on the EC policy, it schedules the read requests to all source targets 
and ensures only to read
-         minimum required input blocks for reconstruction.
+      1. _Read the data from source nodes:_ Input data is read in parallel 
from source nodes using a dedicated thread pool.
+        Based on the EC policy, it schedules the read requests to all source 
targets and reads only the minimum number of input blocks for reconstruction.
 
-      _2.Decode the data and generate the output data:_ Actual 
decoding/encoding is done by using RawErasureEncoder API currently.
-        All the erased data and/or parity blocks will be recovered together.
+      1. _Decode the data and generate the output data:_ New data and parity 
blocks are decoded from the input data. All missing data and parity blocks are 
decoded together.
 
-     _3.Transfer the generated data blocks to target nodes:_ Once decoding is 
finished, it will encapsulate the output data to packets and send them to
-        target Datanodes.
-   To accommodate heterogeneous workloads, we allow files and directories in 
an HDFS cluster to have different replication and EC policies.
-*   **ErasureCodingPolicy**
-    Information on how to encode/decode a file is encapsulated in an 
ErasureCodingPolicy class. Each policy is defined by the following 2 pieces of 
information:
-    _1.The ECScema: This includes the numbers of data and parity blocks in an 
EC group (e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon).
+      1. _Transfer the generated data blocks to target nodes:_ Once decoding 
is finished, the recovered blocks are transferred to target DataNodes.
 
-    _2.The size of a striping cell.
+ *  **ErasureCoding policy**
+    To accommodate heterogeneous workloads, we allow files and directories in 
an HDFS cluster to have different replication and EC policies.
+    Information on how to encode/decode a file is encapsulated in an 
ErasureCodingPolicy class. Each policy is defined by the following 2 pieces of 
information:
 
-   Client and Datanode uses EC codec framework directly for doing the 
endoing/decoding work.
+      1. _The ECSchema:_ This includes the numbers of data and parity blocks 
in an EC group (e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon).
 
- *  **Erasure Codec Framework**
-     We support a generic EC framework which allows system users to define, 
configure, and deploy multiple coding schemas such as conventional 
Reed-Solomon, HitchHicker and
-     so forth.
-     ErasureCoder is provided to encode or decode for a block group in the 
middle level, and RawErasureCoder is provided to perform the concrete algorithm 
calculation in the low level. ErasureCoder can
-     combine and make use of different RawErasureCoders for tradeoff. We 
abstracted coder type, data blocks size, parity blocks size into ECSchema. A 
default system schema using RS (6, 3) is built-in.
-     For the system default codec Reed-Solomon we implemented both 
RSRawErasureCoder in pure Java and NativeRawErasureCoder based on Intel ISA-L. 
Below is the performance
-     comparing for different coding chunk size. We can see that the native 
coder can outperform the Java coder by up to 35X.
+      1. _The size of a striping cell._ This determines the granularity of 
striped reads and writes, including buffer sizes and encoding work.
 
-     _Intel® Storage Acceleration-Library(Intel® ISA-L)_ ISA-L is an Open 
Source Version and is a collection of low-level functions used in storage 
applications.
-     The open source version contains fast erasure codes that implement a 
general Reed-Solomon type encoding for blocks of data that helps protect against
-     erasure of whole blocks. The general ISA-L library contains an expanded 
set of functions used for data protection, hashing, encryption, etc. By
-     leveraging instruction sets like SSE, AVX and AVX2, the erasure coding 
functions are much optimized and outperform greatly on IA platforms. ISA-L
-     supports Linux, Windows and other platforms as well. Additionally, it 
also supports incremental coding so applications don’t have to wait all source
-     blocks to be available before to perform the coding, which can be used in 
HDFS.
+    Currently, HDFS supports the Reed-Solomon and XOR erasure coding 
algorithms. Additional algorithms are planned as future work.
+    The system default scheme is Reed-Solomon (6, 3) with a cell size of 64KB.
 
-Hardware resources
-------------------
-  For using EC feature, you need to prepare for the following.
-    Depending on the ECSchemas used, we need to have minimum number of 
Datanodes available in the cluster. Example if we use ReedSolomon(6, 3) 
ECSchema,
-    then minimum nodes required is 9 to succeed the write. It can tolerate up 
to 3 failures.
 
 Deployment
 ----------
 
-### Configuration details
+### Cluster and hardware configuration
+
+  Erasure coding places additional demands on the cluster in terms of CPU and 
network.
 
-  In the EC feature, raw coders are configurable. So, users need to decide the 
RawCoder algorithms.
-  Configure the customized algorithms with configuration key 
"*io.erasurecode.codecs*".
+  Encoding and decoding work consumes additional CPU on both HDFS clients and 
DataNodes.
 
-  Default Reed-Solomon based raw coders available in built, which can be 
configured by using the configuration key "*io.erasurecode.codec.rs.rawcoder*".
-  And also another default raw coder available if XOR based raw coder. Which 
could be configured by using "*io.erasurecode.codec.xor.rawcoder*"
+  Erasure coded files are also spread across racks for rack fault-tolerance.
+  This means that when reading and writing striped files, most operations are 
off-rack.
+  Network bisection bandwidth is thus very important.
 
-  _EarasureCodingWorker Confugurations:_
-    dfs.datanode.stripedread.threshold.millis - Threshold time for polling 
timeout for read service. Default value is 5000
-    dfs.datanode.stripedread.threads – Number striped read thread pool 
threads. Default value is 20
-    dfs.datanode.stripedread.buffer.size - Buffer size for reader service. 
Default value is 256 * 1024
+  For rack fault-tolerance, it is also important to have at least as many 
racks as the configured EC stripe width.
+  For the default EC policy of RS (6,3), this means minimally 9 racks, and 
ideally 10 or 11 to handle planned and unplanned outages.
+  For clusters with fewer racks than the stripe width, HDFS cannot maintain 
rack fault-tolerance, but will still attempt
+  to spread a striped file across multiple nodes to preserve node-level 
fault-tolerance.
 
-### Deployment details
+### Configuration keys
 
-  With the striping model, client machine is responsible for do the EC endoing 
and tranferring data to the datanodes.
-  So, EC with striping model expects client machines with hghg end 
configurations especially of CPU and network.
+  The codec implementation for Reed-Solomon and XOR can be configured with the 
following client and DataNode configuration keys:
+  `io.erasurecode.codec.rs.rawcoder` and `io.erasurecode.codec.xor.rawcoder`.
+  The default implementations for both of these codecs are pure Java.
+
+  Erasure coding background recovery work on the DataNodes can also be tuned 
via the following configuration parameters:
+
+  1. `dfs.datanode.stripedread.threshold.millis` - Timeout for striped reads. 
Default value is 5000 ms.
+  1. `dfs.datanode.stripedread.threads` - Number of concurrent reader threads. 
Default value is 20 threads.
+  1. `dfs.datanode.stripedread.buffer.size` - Buffer size for reader service. 
Default value is 256KB.
 
 ### Administrative commands
- ErasureCoding command-line is provided to perform administrative commands 
related to ErasureCoding. This can be accessed by executing the following 
command.
+
+  HDFS provides an `erasurecode` subcommand to perform administrative commands 
related to erasure coding.
 
        hdfs erasurecode [generic options]
          [-setPolicy [-s <policyName>] <path>]
@@ -131,18 +118,18 @@ Deployment
 
 Below are the details about each command.
 
-*  **SetPolicy command**: `[-setPolicy [-s <policyName>] <path>]`
+ *  `[-setPolicy [-s <policyName>] <path>]`
 
-    SetPolicy command is used to set an ErasureCoding policy on a directory at 
the specified path.
+    Sets an ErasureCoding policy on a directory at the specified path.
 
-      `path`: Refer to a pre-created directory in HDFS. This is a mandatory 
parameter.
+      `path`: An directory in HDFS. This is a mandatory parameter. Setting a 
policy only affects newly created files, and does not affect existing files.
 
-      `policyName`: This is an optional parameter, specified using ‘-s’ 
flag. Refer to the name of ErasureCodingPolicy to be used for encoding files 
under this directory. If not specified the system default ErasureCodingPolicy 
will be used.
+      `policyName`: The ErasureCoding policy to be used for files under this 
directory. This is an optional parameter, specified using ‘-s’ flag. If no 
policy is specified, the system default ErasureCodingPolicy will be used.
 
-*  **GetPolicy command**: `[-getPolicy <path>]`
+ *  `[-getPolicy <path>]`
 
-     GetPolicy command is used to get details of the ErasureCoding policy of a 
file or directory at the specified path.
+     Get details of the ErasureCoding policy of a file or directory at the 
specified path.
 
-*  **ListPolicies command**:  `[-listPolicies]`
+ *  `[-listPolicies]`
 
-     Lists all supported ErasureCoding policies. For setPolicy command, one of 
these policies' name should be provided.
\ No newline at end of file
+     Lists all supported ErasureCoding policies. These names are suitable for 
use with the `setPolicy` command.

Reply via email to