adding HDFS erasure coding awareness to Accumulo

Seidl, Ed Thu, 05 Sep 2019 16:26:48 -0700

Hi All.

Following up on a conversation with Christopher on the slack channel, what 
follows is a modest proposal to make hosting Accumulo tables on erasure coded 
HDFS directories easier. This post turned out to be pretty long…if you already 
know what erasure coding in HDFS is about, skip down a page to the paragraph 
that starts with “Sorry”.

First things first, a brief intro to erasure coding (EC). As we all know, by
default HDFS file systems achieve durability via block replication. Usually
the replication level is set to 3, so the resulting disk overhead for
reliability is 200%. (Yes, there are also performance benefits to the
replication when disk locality can be exploited, but I'm going to ignore that
for now). Hadoop
3.0 introduced EC as a better way to achieve durability. More info can be
found here
(https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html).
EC behaves much like RAID...for M blocks of data, N blocks of parity data are
generated, from which the original data can be recovered in the event of data
failures. A typical EC scheme to use is Reed-Solomon 6-3, where 6 blocks of
data produce 3 blocks of parity, an overhead of only 50%. In addition to the
factor of 2 increase in available disk space, RS-6-3 is also more fault
tolerant...a loss of 3 data blocks can be tolerated, compared to triple
replication where only two blocks can be lost.

More storage, better resiliency, so what's the catch? One concern with using
EC is the time spent calculating the parity blocks. Unlike the default
replication write path, where a client writes a block, and then the datanodes
take care of replicating the data, an EC HDFS client is responsible for
computing the parity and sending that to the datanodes. This increases the CPU
and network load on the client. The CPU
hit can be mitigated through the use of the intel ISA-L library, but only on
CPUs that support AVX. (See
https://www.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance
and
https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/
for some interesting claims). In our testing, we've found that sequential
writes to an EC encoded directory can be as much as 3 times faster than to a
directory with replication, and reads are up to 2 times faster.

Another side effect of EC is the loss of data locality. For performance
reasons, EC data blocks are striped, so multiple datanodes must be contacted to
read a single block of data. For large sequential reads this doesn't appear to
be much of a problem, but it can be an issue for small random lookups. For the
latter case, we've found that using 64K stripes (rather than the default 1M)
can mitigate some of the random lookup pain without compromising sequential
read/write performance.

In terms of Accumulo performance, we don't see a dramatic difference in scan
performance between EC and replication. This is due to the fact that our
tables are gzip compressed, so the time to decompress and then do the
deserialization and other key management tasks far outweighs the actual disk
I/O times. The same seems to be the case for batch writing of RFiles. Here
are the results for a test I did using 40 Spark executors, each writing 10M
rows to RFiles in RS-6-3 and replicated directories:

Time (min)
EC63 REPL size (GB)
gz 2.7 2.7 21.3
none 2.0 3.0 158.5
snappy 1.6 1.6 38.4

The only time EC makes a difference here is when compression is turned
off...then the write speed is 50% faster with EC. (It's interesting to note
one nice trade-off that EC makes possible...using faster Snappy compression
uses 2X the disk space, but if EC is used, then you get that 2x back).

As noted above, random access times can be a problem with EC. This impacts
Accumulo's ability to randomly fetch single rows of data. In a test of 16
Spark executors doing random seeks in batches of 10 into a table with 10B rows,
we found the following latencies per row:

latency per row(ms)
min max avg
RS-10-4-1M 4 214 11
RS-6-3-1M 3 130 7
RS-6-3-64K 2 106 4
replication 1 73 2

One big gotcha that was not immediately evident is this: the current HDFS EC
implementation does not support hflush() or hsync(). These operations are
no-ops. I discovered this the hard way when we had an unexpected power outage
in our data center. I had been using EC for the write-ahead log directories,
and Accumulo was in the midst of updating the metadata table when the power
went out. Because flush()
returned without actually writing anything to disk, we lost some writes to the
metadata table, which resulted in all of our tables being unreadable.
Thankfully it was a dev system so the loss wasn't a big deal (plus the data was
still there, and I did recover what I needed by re-importing the RFiles). The
good news is that EC is not an all-or-nothing thing, but is instead implemented
on a per-directory basis,
with children inheriting their parent's policy at creation time. So moving
forward, we keep our WAL and the accumulo namespace in replicated directories,
but use EC for the rest of our table data.

Sorry if that was a TL;DR...now to my proposal. As it stands now, EC is
controlled at the HDFS directory level, so the only way to turn it on for
Accumulo is to use the "hdfs ec" command to set the encoding on individual
directories under /accumulo/tables. So, to create a table with EC, the steps
are 1) create table, 2) look up table ID, 3) use command line to set policy for
/accumulo/tables/<ID>. What I propose is to add a per-table/namespace property
“table.encoding.policy”. Setting the policy for a namespace would ensure all
tables subsequently created in that namespace would inherit that policy, but
this could then be overridden on individual tables if need be. One caveat here
is that changing the policy on an existing directory does not change the policy
on the files within it...instead the data needs to be copied so that it is
rewritten with the appropriate policy. Thus, to change the encoding policy for
an existing table would require a major compaction to rewrite all the RFiles
for that table.

And while we're adding encoding policy, I thought it would also be good to be
able to specify the HDFS storage policy (HOT, COLD, WARM, etc.), which is also
set on a per-directory basis. So, a second property “table.storage.policy” is
proposed (but I should note that my humble cluster won’t allow me to test this
beyond setting the policy on the directories…I don’t have tiered storage to see
if it actually makes a difference).

Both the encoding and storage policies can be enforced by adding another
mkdirs() method to org.apache.accumulo.server.fs.VolumeManager that takes the
path, storage policy, and encoding policy as arguments, as well as a helper
function checkDirPolices(). This would require changes to many of the
operations under master/tableOps that call mkdirs(). Some changes to
org.apache.accumulo.tserver.tablet.Tablet will also be needed to account for
directory creation during splits, as well as detecting property changes.

I have this implemented in Accumulo 2.0, and can share patches if there is
interest. I'm worried that the approach I took is a little too HDFS specific,
so for sure some thought would have to go into how to modify the HDFS
implementation without adding burden down the road should another filesystem
implementation be added.

I welcome any thoughts, suggestions or not-too-barbed criticisms :) Thanks for
reading.

adding HDFS erasure coding awareness to Accumulo

Reply via email to