This is an automated email from the ASF dual-hosted git repository. kturner pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/accumulo-website.git
The following commit(s) were added to refs/heads/master by this push: new 54c5b50 add a blog post about using HDFS erasure coding with Accumulo (#194) 54c5b50 is described below commit 54c5b5058860cc366711c9e521653c0fa2cb1837 Author: etseidl <etse...@users.noreply.github.com> AuthorDate: Wed Sep 18 14:41:49 2019 -0700 add a blog post about using HDFS erasure coding with Accumulo (#194) --- _posts/blog/2019-09-17-erasure-coding.md | 181 +++++++++++++++++++++++++++++++ images/blog/201909_ec/ec-latency-14.png | Bin 0 -> 751396 bytes images/blog/201909_ec/ec-latency-14e.png | Bin 0 -> 843379 bytes images/blog/201909_ec/ec-latency-16.png | Bin 0 -> 854230 bytes 4 files changed, 181 insertions(+) diff --git a/_posts/blog/2019-09-17-erasure-coding.md b/_posts/blog/2019-09-17-erasure-coding.md new file mode 100644 index 0000000..39083be --- /dev/null +++ b/_posts/blog/2019-09-17-erasure-coding.md @@ -0,0 +1,181 @@ +--- +title: "Using HDFS Erasure Coding with Accumulo" +author: Ed Seidl +reviewers: +--- + +HDFS normally stores multiple copies of each file for both performance and durability reasons. +The number of copies is controlled via HDFS replication settings, and by default is set to 3. Hadoop 3, +introduced the use of erasure coding (EC), which improves durability while decreasing overhead. +Since Accumulo 2.0 now supports Hadoop 3, it's time to take a look at whether using +EC with Accumulo makes sense. + +* [EC Intro](#ec-intro) +* [EC Performance](#ec-performance) +* [Accumulo Performance with EC](#accumulo-performance-with-ec) + +### EC Intro + +By default HDFS achieves durability via block replication. Usually +the replication count is 3, resulting in a storage overhead of 200%. Hadoop 3 +introduced EC as a better way to achieve durability. More info can be +found [here](https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html). +EC behaves much like RAID 5 or 6...for *k* blocks of data, *m* blocks of +parity data are generated, from which the original data can be recovered in the +event of disk or node failures (erasures, in EC parlance). A typical EC scheme is Reed-Solomon 6-3, where +6 data blocks produce 3 parity blocks, an overhead of only 50%. In addition +to doubling the available disk space, RS-6-3 is also more fault +tolerant...a loss of 3 data blocks can be tolerated, where triple replication +can only lose two blocks. + +More storage, better resiliency, so what's the catch? One concern is +the time spent calculating the parity blocks. Unlike replication +, where a client writes a block, and then the DataNodes replicate +the data, an EC HDFS client is responsible for computing the parity and sending that +to the DataNodes. This increases the CPU and network load on the client. The CPU +hit can be mitigated by using Intels ISA-L library, but only on CPUs +that support AVX or AVX2 instructions. (See [EC Myths] and [EC Introduction] +for some interesting claims). In addition, unlike the serial replication I/O path, +the EC I/O path is parallel providing greater throughput. In our testing, sequential writes to +an EC directory were as much as 3 times faster than a replication directory +, and reads were up to 2 times faster. + +Another side effect of EC is loss of data locality. For performance reasons, EC +data blocks are striped, so multiple DataNodes must be contacted to read a single +block of data. For large sequential reads this is not a +problem, but it can be an issue for small random lookups. For the latter case, +using RS 6-3 with 64KB stripes mitigates some of the random lookup pain +without compromising sequential read/write performance. + +#### Important Warning + +Before continuing, an important caveat; the current implementation of EC on Hadoop supports neither hsync +nor hflush. Both of these operations are silent no-ops (EC [limitations]). We discovered this the hard +way when a data center power loss resulted in write-ahead log corruption, which were +stored in an EC directory. To avoid this problem ensure all +WAL directories use replication. It's probably a good idea to keep the +accumulo namespace replicated as well, but we have no evidence to back up that assertion. As with all +things, don't test on production data. + +### EC Performance + +To test EC performance, we created a series of clusters on AWS. Our Accumulo stack consisted of +Hadoop 3.1.1 built with the Intel ISA-L library enabled, Zookeeper 3.4.13, and Accumulo 1.9.3 configured +to work with Hadoop 3 (we did our testing before the official release of Accumulo 2.0). The encoding +policy is set per-directory using the [hdfs] command-line tool. To set the encoding policy +for an Accumulo table, first find the table ID (for instance using the Accumulo shell's +"table -l" command), and then from the command line set the policy for the corresponding directory +under /accumulo/tables. Note that changing the policy on a directory will set the policy for +child directories, but will not change any files contained within. To change the policy on an existing +Accumulo table, you must first set the encoding policy, and then run a major compaction to rewrite +the RFiles for the table. + +Our first tests were of sequential read and write performance straight to HDFS. For this test we had +a cluster of 32 HDFS nodes (c5.4xlarge [AWS] instances), 16 Spark nodes (r5.4xlarge), +3 zookeepers (r5.xlarge), and 1 master (r5.2xlarge). + +The first table below shows the results for writing a 1TB file. The results are the average of three runs +for each of the directory encodings Reed-Solomon (RS) 6-3 with 64KB stripes, RS 6-3 with 1MB stripes, +RS 10-4 with 1MB stripes, and the default triple replication. We also varied the number of concurrent +Spark executors, performing tests with 16 executors that did not stress the cluster in any area, and with +128 executors which exhausted our network bandwidth allotment of 5 Gbps. As can be seen, in the 16 executor +environment, we saw greater than a 3X bump in throughput using RS 10-4 with 1MB stripes over triple replication. +At saturation, the speed up was still over 2X, which is in line with the results from [EC Myths]. Also of note, +using RS 6-3 with 64KB stripes performed better than the same with 1MB stripes, which is a nice result for Accumulo, +as we'll show later. + +|Encoding|16 executors|128 executors| +|--------|-----------:|------------:| +|Replication|2.19 GB/s|4.13 GB/s| +|RS 6-3 64KB|6.33 GB/s|8.11 GB/s| +|RS 6-3 1MB|6.22 GB/s|7.93 GB/s| +|RS 10-4 1MB|7.09 GB/s|8.34 GB/s| + +Our read tests are not as dramatic as those in [EC Myths], but still looking good for EC. Here we show the +results for reading back the 1TB file created in the write test using 16 Spark executors. In addition to +the straight read tests, we also performed tests with 2 DataNodes disabled to simulate the performance hit +of failures which require data repair in the foreground. Finally, we tested the read performance +after a background rebuild of the filesystem. We did this to see if the foreground rebuild or +the loss of 2 DataNodes was the major contributor to any performance degradation. As can be seen, +EC read performance is close to 2X faster than replication, even in the face of failures. + +|Encoding|32 nodes<br>no failures|30 nodes<br>with failures|30 nodes<br>no failures| +|--------|----------------------:|------------------------:|----------------------:| +|Replication|3.95 GB/s|3.99 GB/s|3.89 GB/s| +|RS 6-3 64KB|7.36 GB/s|7.27 GB/s|7.16 GB/s| +|RS 6-3 1MB|6.59 GB/s|6.47 GB/s|6.53 GB/s| +|RS 10-4 1MB|6.21 GB/s|6.08 GB/s|6.21 GB/s| + +### Accumulo Performance with EC + +While the above results are impressive, they are not representative of how Accumulo uses HDFS. For starters, +Accumulo sequential I/O is doing far more than just reading or writing files; compression and serialization, +for example, place quite a load upon the tablet server CPUs. An example to illustrate this is shown below. +The time in minutes to bulk-write 400 million rows to RFiles with 40 Spark executors is listed for both EC +using RS 6-3 with 1MB stripes and triple replication. The choice of compressor has a much more profound +effect on the write times than the choice of underlying encoding for the directory being written to +(although without compression EC is much faster than replication). + +|Compressor | RS 6-3 1MB | Replication | File size (GB) | +|---------- | ---------: | ----------: | -------------: | +|gz | 2.7 | 2.7 | 21.3 | +|none | 2.0 | 3.0 | 158.5 | +|snappy | 1.6 | 1.6 | 38.4 | + +Of much more importance to Accumulo performance is read latency. A frequent use case for our group is to obtain a +number of row IDs from an index and then use a BatchScanner to read those individual rows. +In this use case, the time to access a single row is far more important than the raw I/O performance. To test +Accumulo's performance with EC for this use case, we did a series of tests against a 10 billion row table, +with each row consisting of 10 columns. 16 Spark executors each performed 10000 queries, where each query +sought 10 random rows. Thus 16 million individual rows were returned in batches of 10. For each batch of +10, the time in milliseconds was captured, and theses times were collected in a histogram of 50ms buckets, with +a catch-all bucket for queries that took over 1 second. For this test we reconfigured our cluster to make use +of c5n.4xlarge nodes featuring must faster networking speeds (15 Gbps sustained vs 5 Gbps for +c5.4xlarge). Because these nodes are in short supply, we ran with only 16 HDFS nodes (c5n.4xlarge), +but still had 16 Spark nodes (also c5n.4xlarge). Zookeeper and master nodes remained the same. + +In the table below, we show the min, max, and average times in milliseconds for each batch of 10 across +four different encoding policies. The clear winner here is replication, and the clear loser RS 10-4 with +1MB stripes, but RS 6-3 with 64KB stripes is not looking too bad. + +|Encoding|Min|Avg|Max| +|--------|--:|--:|--:| +|RS 10-4 1MB|40|105|2148| +|RS 6-3 1MB|30|68|1297| +|RS 6-3 64KB|23|43|1064| +|Replication|11|23|731| + +The above results also hold in the event of errors. The next table shows the same test, but with 2 DataNodes +disabled to simulate failures that require foreground rebuilds. Again, replication wins, and RS 10-4 1MB +loses, but RS 6-3 64KB remains a viable option. + +|Encoding|Min|Avg|Max| +|--------|--:|--:|--:| +|RS 10-4 1MB|53|143|3221| +|RS 6-3 1MB|34|113|1662| +|RS 6-3 64KB|24|61|1402| +|Replication|12|26|304| + +The images below show a plots of the histograms. The third plot was generated with 14 HDFS DataNodes, but after +all missing data had been repaired. Again, this was done to see how much of the performance degradation could be +attributed to missing data, and how much to simply having less computing power available. + +<center> +<img src='/images/blog/201909_ec/ec-latency-16.png' width="75%"><br><br> + +<img src='/images/blog/201909_ec/ec-latency-14e.png' width="75%"><br><br> + +<img src='/images/blog/201909_ec/ec-latency-14.png' width="75%"> +</center> + +### Conclusion +HDFS with erasure coding has the potential to double your available Accumulo storage, at the cost of a hit in +random seek times, but a potential increase in sequential scan performance. We will be proposing some changes +to Accumulo to make working with EC a bit easier. Our initial thoughts are collected in this +Accumulo dev list [post](https://lists.apache.org/thread.html/4ac5b0f664e15fa120e748892612f1e417b7dee3e1539669d179900c@%3Cdev.accumulo.apache.org%3E). + +[EC Myths]: https://www.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance +[EC Introduction]: https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ +[AWS]: https://aws.amazon.com/ec2/instance-types/ +[limitations]: https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations +[hdfs]: https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Administrative_commands diff --git a/images/blog/201909_ec/ec-latency-14.png b/images/blog/201909_ec/ec-latency-14.png new file mode 100644 index 0000000..a4db326 Binary files /dev/null and b/images/blog/201909_ec/ec-latency-14.png differ diff --git a/images/blog/201909_ec/ec-latency-14e.png b/images/blog/201909_ec/ec-latency-14e.png new file mode 100644 index 0000000..ade951f Binary files /dev/null and b/images/blog/201909_ec/ec-latency-14e.png differ diff --git a/images/blog/201909_ec/ec-latency-16.png b/images/blog/201909_ec/ec-latency-16.png new file mode 100644 index 0000000..fef02d2 Binary files /dev/null and b/images/blog/201909_ec/ec-latency-16.png differ