This is an automated email from the ASF dual-hosted git repository.

kturner pushed a commit to branch master
in repository

The following commit(s) were added to refs/heads/master by this push:
     new 54c5b50  add a blog post about using HDFS erasure coding with Accumulo 
54c5b50 is described below

commit 54c5b5058860cc366711c9e521653c0fa2cb1837
Author: etseidl <>
AuthorDate: Wed Sep 18 14:41:49 2019 -0700

    add a blog post about using HDFS erasure coding with Accumulo (#194)
 _posts/blog/ | 181 +++++++++++++++++++++++++++++++
 images/blog/201909_ec/ec-latency-14.png  | Bin 0 -> 751396 bytes
 images/blog/201909_ec/ec-latency-14e.png | Bin 0 -> 843379 bytes
 images/blog/201909_ec/ec-latency-16.png  | Bin 0 -> 854230 bytes
 4 files changed, 181 insertions(+)

diff --git a/_posts/blog/ 
new file mode 100644
index 0000000..39083be
--- /dev/null
+++ b/_posts/blog/
@@ -0,0 +1,181 @@
+title: "Using HDFS Erasure Coding with Accumulo"
+author: Ed Seidl
+HDFS normally stores multiple copies of each file for both performance and 
durability reasons. 
+The number of copies is controlled via HDFS replication settings, and by 
default is set to 3. Hadoop 3, 
+introduced the use of erasure coding (EC), which improves durability while 
decreasing overhead.
+Since Accumulo 2.0 now supports Hadoop 3, it's time to take a look at whether 
+EC with Accumulo makes sense.
+* [EC Intro](#ec-intro)
+* [EC Performance](#ec-performance)
+* [Accumulo Performance with EC](#accumulo-performance-with-ec)
+### EC Intro
+By default HDFS achieves durability via block replication.  Usually
+the replication count is 3, resulting in a storage overhead of 200%. Hadoop 3 
+introduced EC as a better way to achieve durability.  More info can be
+EC behaves much like RAID 5 or 6...for *k* blocks of data, *m* blocks of
+parity data are generated, from which the original data can be recovered in the
+event of disk or node failures (erasures, in EC parlance).  A typical EC 
scheme is Reed-Solomon 6-3, where
+6 data blocks produce 3 parity blocks, an overhead of only 50%.  In addition
+to doubling the available disk space, RS-6-3 is also more fault
+tolerant...a loss of 3 data blocks can be tolerated, where triple replication
+can only lose two blocks.
+More storage, better resiliency, so what's the catch?  One concern is
+the time spent calculating the parity blocks.  Unlike replication
+, where a client writes a block, and then the DataNodes replicate
+the data, an EC HDFS client is responsible for computing the parity and 
sending that
+to the DataNodes.  This increases the CPU and network load on the client.  The 
+hit can be mitigated by using Intels ISA-L library, but only on CPUs
+that support AVX or AVX2 instructions.  (See [EC Myths] and [EC Introduction]
+for some interesting claims). In addition, unlike the serial replication I/O 
+the EC I/O path is parallel providing greater throughput. In our testing, 
sequential writes to 
+an EC directory were as much as 3 times faster than a replication directory 
+, and reads were up to 2 times faster.
+Another side effect of EC is loss of data locality.  For performance reasons, 
+data blocks are striped, so multiple DataNodes must be contacted to read a 
+block of data.  For large sequential reads this is not a
+problem, but it can be an issue for small random lookups.  For the latter case,
+using RS 6-3 with 64KB stripes mitigates some of the random lookup pain
+without compromising sequential read/write performance.
+#### Important Warning
+Before continuing, an important caveat;  the current implementation of EC on 
Hadoop supports neither hsync
+nor hflush.  Both of these operations are silent no-ops (EC [limitations]).  
We discovered this the hard
+way when a data center power loss resulted in write-ahead log corruption, 
which were
+stored in an EC directory.  To avoid this problem ensure all 
+WAL directories use replication.  It's probably a good idea to keep the
+accumulo namespace replicated as well, but we have no evidence to back up that 
assertion.  As with all
+things, don't test on production data.
+### EC Performance
+To test EC performance, we created a series of clusters on AWS.  Our Accumulo 
stack consisted of
+Hadoop 3.1.1 built with the Intel ISA-L library enabled, Zookeeper 3.4.13, and 
Accumulo 1.9.3 configured
+to work with Hadoop 3 (we did our testing before the official release of 
Accumulo 2.0). The encoding
+policy is set per-directory using the [hdfs] command-line tool. To set the 
encoding policy
+for an Accumulo table, first find the table ID (for instance using the 
Accumulo shell's
+"table -l" command), and then from the command line set the policy for the 
corresponding directory
+under /accumulo/tables.  Note that changing the policy on a directory will set 
the policy for
+child directories, but will not change any files contained within.  To change 
the policy on an existing
+Accumulo table, you must first set the encoding policy, and then run a major 
compaction to rewrite
+the RFiles for the table.
+Our first tests were of sequential read and write performance straight to 
HDFS.  For this test we had
+a cluster of 32 HDFS nodes (c5.4xlarge [AWS] instances), 16 Spark nodes 
+3 zookeepers (r5.xlarge), and 1 master (r5.2xlarge).
+The first table below shows the results for writing a 1TB file.  The results 
are the average of three runs
+for each of the directory encodings Reed-Solomon (RS) 6-3 with 64KB stripes, 
RS 6-3 with 1MB stripes,
+RS 10-4 with 1MB stripes, and the default triple replication.  We also varied 
the number of concurrent
+Spark executors, performing tests with 16 executors that did not stress the 
cluster in any area, and with
+128 executors which exhausted our network bandwidth allotment of 5 Gbps. As 
can be seen, in the 16 executor
+environment, we saw greater than a 3X bump in throughput using RS 10-4 with 
1MB stripes over triple replication.
+At saturation, the speed up was still over 2X, which is in line with the 
results from [EC Myths]. Also of note,
+using RS 6-3 with 64KB stripes performed better than the same with 1MB 
stripes, which is a nice result for Accumulo, 
+as we'll show later.
+|Encoding|16 executors|128 executors|
+|Replication|2.19 GB/s|4.13 GB/s|
+|RS 6-3 64KB|6.33 GB/s|8.11 GB/s|
+|RS 6-3 1MB|6.22 GB/s|7.93 GB/s|
+|RS 10-4 1MB|7.09 GB/s|8.34 GB/s|
+Our read tests are not as dramatic as those in [EC Myths], but still looking 
good for EC.  Here we show the
+results for reading back the 1TB file created in the write test using 16 Spark 
executors.  In addition to
+the straight read tests, we also performed tests with 2 DataNodes disabled to 
simulate the performance hit
+of failures which require data repair in the foreground.  Finally, we tested 
the read performance
+after a background rebuild of the filesystem.  We did this to see if the 
foreground rebuild or
+the loss of 2 DataNodes was the major contributor to any performance 
degradation.  As can be seen,
+EC read performance is close to 2X faster than replication, even in the face 
of failures.
+|Encoding|32 nodes<br>no failures|30 nodes<br>with failures|30 nodes<br>no 
+|Replication|3.95 GB/s|3.99 GB/s|3.89 GB/s|
+|RS 6-3 64KB|7.36 GB/s|7.27 GB/s|7.16 GB/s|
+|RS 6-3 1MB|6.59 GB/s|6.47 GB/s|6.53 GB/s|
+|RS 10-4 1MB|6.21 GB/s|6.08 GB/s|6.21 GB/s|
+### Accumulo Performance with EC
+While the above results are impressive, they are not representative of how 
Accumulo uses HDFS.  For starters,
+Accumulo sequential I/O is doing far more than just reading or writing files; 
compression and serialization,
+for example, place quite a load upon the tablet server CPUs.  An example to 
illustrate this is shown below.
+The time in minutes to bulk-write 400 million rows to RFiles with 40 Spark 
executors is listed for both EC
+using RS 6-3 with 1MB stripes and triple replication.  The choice of 
compressor has a much more profound
+effect on the write times than the choice of underlying encoding for the 
directory being written to 
+(although without compression EC is much faster than replication).
+|Compressor | RS 6-3 1MB | Replication | File size (GB) |
+|---------- | ---------: | ----------: | -------------: |
+|gz | 2.7 | 2.7 | 21.3 |
+|none | 2.0 | 3.0 | 158.5 |
+|snappy | 1.6 | 1.6 | 38.4 |
+Of much more importance to Accumulo performance is read latency. A frequent 
use case for our group is to obtain a
+number of row IDs from an index and then use a BatchScanner to read those 
individual rows.
+In this use case, the time to access a single row is far more important than 
the raw I/O performance.  To test
+Accumulo's performance with EC for this use case, we did a series of tests 
against a 10 billion row table,
+with each row consisting of 10 columns.  16 Spark executors each performed 
10000 queries, where each query
+sought 10 random rows.  Thus 16 million individual rows were returned in 
batches of 10.  For each batch of
+10, the time in milliseconds was captured, and theses times were collected in 
a histogram of 50ms buckets, with
+a catch-all bucket for queries that took over 1 second.  For this test we 
reconfigured our cluster to make use
+of c5n.4xlarge nodes featuring must faster networking speeds (15 Gbps 
sustained vs 5 Gbps for 
+c5.4xlarge). Because these nodes are in short supply, we ran with only 16 HDFS 
nodes (c5n.4xlarge), 
+but still had 16 Spark nodes (also c5n.4xlarge).  Zookeeper and master nodes 
remained the same.
+In the table below, we show the min, max, and average times in milliseconds 
for each batch of 10 across
+four different encoding policies.  The clear winner here is replication, and 
the clear loser RS 10-4 with 
+1MB stripes, but RS 6-3 with 64KB stripes is not looking too bad.
+|RS 10-4 1MB|40|105|2148|
+|RS 6-3 1MB|30|68|1297|
+|RS 6-3 64KB|23|43|1064|
+The above results also hold in the event of errors.  The next table shows the 
same test, but with 2 DataNodes
+disabled to simulate failures that require foreground rebuilds.  Again, 
replication wins, and RS 10-4 1MB
+loses, but RS 6-3 64KB remains a viable option.
+|RS 10-4 1MB|53|143|3221|
+|RS 6-3 1MB|34|113|1662|
+|RS 6-3 64KB|24|61|1402|
+The images below show a plots of the histograms.  The third plot was generated 
with 14 HDFS DataNodes, but after
+all missing data had been repaired.  Again, this was done to see how much of 
the performance degradation could be
+attributed to missing data, and how much to simply having less computing power 
+<img src='/images/blog/201909_ec/ec-latency-16.png' width="75%"><br><br>
+<img src='/images/blog/201909_ec/ec-latency-14e.png' width="75%"><br><br>
+<img src='/images/blog/201909_ec/ec-latency-14.png' width="75%">
+### Conclusion
+HDFS with erasure coding has the potential to double your available Accumulo 
storage, at the cost of a hit in
+random seek times, but a potential increase in sequential scan performance. We 
will be proposing some changes
+to Accumulo to make working with EC a bit easier. Our initial thoughts are 
collected in this 
+Accumulo dev list 
+[EC Myths]:
+[EC Introduction]:
diff --git a/images/blog/201909_ec/ec-latency-14.png 
new file mode 100644
index 0000000..a4db326
Binary files /dev/null and b/images/blog/201909_ec/ec-latency-14.png differ
diff --git a/images/blog/201909_ec/ec-latency-14e.png 
new file mode 100644
index 0000000..ade951f
Binary files /dev/null and b/images/blog/201909_ec/ec-latency-14e.png differ
diff --git a/images/blog/201909_ec/ec-latency-16.png 
new file mode 100644
index 0000000..fef02d2
Binary files /dev/null and b/images/blog/201909_ec/ec-latency-16.png differ

Reply via email to