Hi All.

Following up on a conversation with Christopher on the slack channel, what 
follows is a modest proposal to make hosting Accumulo tables on erasure coded 
HDFS directories easier. This post turned out to be pretty long…if you already 
know what erasure coding in HDFS is about, skip down a page to the paragraph 
that starts with “Sorry”.

First things first, a brief intro to erasure coding (EC).  As we all know, by 
default HDFS file systems achieve durability via block replication.  Usually 
the replication level is set to 3, so the resulting disk overhead for 
reliability is 200%.  (Yes, there are also performance benefits to the 
replication when disk locality can be exploited, but I'm going to ignore that 
for now).  Hadoop
3.0 introduced EC as a better way to achieve durability.  More info can be 
found here 
(https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html).
  EC behaves much like RAID...for M blocks of data, N blocks of parity data are 
generated, from which the original data can be recovered in the event of data 
failures.  A typical EC scheme to use is Reed-Solomon 6-3, where 6 blocks of 
data produce 3 blocks of parity, an overhead of only 50%.  In addition to the 
factor of 2 increase in available disk space, RS-6-3 is also more fault 
tolerant...a loss of 3 data blocks can be tolerated, compared to triple 
replication where only two blocks can be lost.

More storage, better resiliency, so what's the catch?  One concern with using 
EC is the time spent calculating the parity blocks.  Unlike the default 
replication write path, where a client writes a block, and then the datanodes 
take care of replicating the data, an EC HDFS client is responsible for 
computing the parity and sending that to the datanodes.  This increases the CPU 
and network load on the client.  The CPU
hit can be mitigated through the use of the intel ISA-L library, but only on 
CPUs that support AVX.  (See 
https://www.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance
 and 
https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ 
for some interesting claims).  In our testing, we've found that sequential 
writes to an EC encoded directory can be as much as 3 times faster than to a 
directory with replication, and reads are up to 2 times faster.

Another side effect of EC is the loss of data locality.  For performance 
reasons, EC data blocks are striped, so multiple datanodes must be contacted to 
read a single block of data.  For large sequential reads this doesn't appear to 
be much of a problem, but it can be an issue for small random lookups.  For the 
latter case, we've found that using 64K stripes (rather than the default 1M) 
can mitigate some of the random lookup pain without compromising sequential 
read/write performance.

In terms of Accumulo performance, we don't see a dramatic difference in scan 
performance between EC and replication.  This is due to the fact that our 
tables are gzip compressed, so the time to decompress and then do the 
deserialization and other key management tasks far outweighs the actual disk 
I/O times.  The same seems to be the case for batch writing of RFiles.  Here 
are the results for a test I did using 40 Spark executors, each writing 10M 
rows to RFiles in RS-6-3 and replicated directories:

       Time (min)
       EC63   REPL  size (GB)
gz     2.7    2.7     21.3
none   2.0    3.0    158.5
snappy 1.6    1.6     38.4

The only time EC makes a difference here is when compression is turned 
off...then the write speed is 50% faster with EC.  (It's interesting to note 
one nice trade-off that EC makes possible...using faster Snappy compression 
uses 2X the disk space, but if EC is used, then you get that 2x back).

As noted above, random access times can be a problem with EC.  This impacts 
Accumulo's ability to randomly fetch single rows of data.  In a test of 16 
Spark executors doing random seeks in batches of 10 into a table with 10B rows, 
we found the following latencies per row:

            latency per row(ms)
              min  max  avg
RS-10-4-1M      4  214   11
RS-6-3-1M       3  130    7
RS-6-3-64K      2  106    4
replication     1   73    2

One big gotcha that was not immediately evident is this:  the current HDFS EC 
implementation does not support hflush() or hsync().  These operations are 
no-ops. I discovered this the hard way when we had an unexpected power outage 
in our data center.  I had been using EC for the write-ahead log directories, 
and Accumulo was in the midst of updating the metadata table when the power 
went out.  Because flush()
returned without actually writing anything to disk, we lost some writes to the 
metadata table, which resulted in all of our tables being unreadable.  
Thankfully it was a dev system so the loss wasn't a big deal (plus the data was 
still there, and I did recover what I needed by re-importing the RFiles).  The 
good news is that EC is not an all-or-nothing thing, but is instead implemented 
on a per-directory basis,
with children inheriting their parent's policy at creation time.  So moving 
forward, we keep our WAL and the accumulo namespace in replicated directories, 
but use EC for the rest of our table data.

Sorry if that was a TL;DR...now to my proposal.  As it stands now, EC is 
controlled at the HDFS directory level, so the only way to turn it on for 
Accumulo is to use the "hdfs ec" command to set the encoding on individual 
directories under /accumulo/tables. So, to create a table with EC, the steps 
are 1) create table, 2) look up table ID, 3) use command line to set policy for 
/accumulo/tables/<ID>.  What I propose is to add a per-table/namespace property 
“table.encoding.policy”.  Setting the policy for a namespace would ensure all 
tables subsequently created in that namespace would inherit that policy, but 
this could then be overridden on individual tables if need be.  One caveat here 
is that changing the policy on an existing directory does not change the policy 
on the files within it...instead the data needs to be copied so that it is 
rewritten with the appropriate policy.  Thus, to change the encoding policy for 
an existing table would require a major compaction to rewrite all the RFiles 
for that table.

And while we're adding encoding policy, I thought it would also be good to be 
able to specify the HDFS storage policy (HOT, COLD, WARM, etc.), which is also 
set on a per-directory basis. So, a second property “table.storage.policy” is 
proposed (but I should note that my humble cluster won’t allow me to test this 
beyond setting the policy on the directories…I don’t have tiered storage to see 
if it actually makes a difference).

Both the encoding and storage policies can be enforced by adding another 
mkdirs() method to org.apache.accumulo.server.fs.VolumeManager that takes the 
path, storage policy, and encoding policy as arguments, as well as a helper 
function checkDirPolices().  This would require changes to many of the 
operations under master/tableOps that call mkdirs(). Some changes to 
org.apache.accumulo.tserver.tablet.Tablet will also be needed to account for 
directory creation during splits, as well as detecting property changes.

I have this implemented in Accumulo 2.0, and can share patches if there is 
interest.  I'm worried that the approach I took is a little too HDFS specific, 
so for sure some thought would have to go into how to modify the HDFS 
implementation without adding burden down the road should another filesystem 
implementation be added.

I welcome any thoughts, suggestions or not-too-barbed criticisms :)  Thanks for 
reading.

Ed

Reply via email to