[ 
https://issues.apache.org/jira/browse/HDFS-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341822#comment-14341822
 ] 

Todd Lipcon commented on HDFS-7845:
-----------------------------------

One quick hint here that may help: for a series of ints like block IDs, just 
using plain snappy/lz4 isn't likely to make a big difference (the maximum run 
length is pretty much bounded by the size of the integer, because every block 
ID is different). However, you could likely make a very good improvement by 
doing something like the following:

1) On the DN, sort the blocks by ascending block ID before doing the block 
report. This only happens on the DN side, so it's easy to scale and doesn't 
consume NN CPU.
2) Shuffle the resulting array so that you have all of the MSBs, followed by 
all of the second most significant bits, etc. Essentially, you're converting to 
a columnar layout where each bit position within the ints is a column. This can 
be done very efficiently with SSE instructions with a bit of JNI (similar 
throughput to memcpy). The result is likely to have long runs of 1 or 0 bits if 
the input block IDs are clustered around certain sets of values.
3) Run the result through LZ4 or Snappy.

You could optionally insert a differential encoding step between (1) and (2) 
which would probably improve the compression ratio with little cost.

I didn't come up with the bit-shuffling idea - you can read more about it at 
http://www.blosc.org/ which also has some benchmarks showing that it gets very 
good compression performance and adds almost no overhead relative to LZ4. It's 
also significantly faster than vint-encoding from a CPU standpoint (since vints 
tend to be branchy)

> Compress block reports
> ----------------------
>
>                 Key: HDFS-7845
>                 URL: https://issues.apache.org/jira/browse/HDFS-7845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7836
>            Reporter: Colin Patrick McCabe
>            Assignee: Charles Lamb
>
> We should optionally compress block reports using a low-cpu codec such as lz4 
> or snappy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to