[
https://issues.apache.org/jira/browse/HDFS-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341822#comment-14341822
]
Todd Lipcon commented on HDFS-7845:
-----------------------------------
One quick hint here that may help: for a series of ints like block IDs, just
using plain snappy/lz4 isn't likely to make a big difference (the maximum run
length is pretty much bounded by the size of the integer, because every block
ID is different). However, you could likely make a very good improvement by
doing something like the following:
1) On the DN, sort the blocks by ascending block ID before doing the block
report. This only happens on the DN side, so it's easy to scale and doesn't
consume NN CPU.
2) Shuffle the resulting array so that you have all of the MSBs, followed by
all of the second most significant bits, etc. Essentially, you're converting to
a columnar layout where each bit position within the ints is a column. This can
be done very efficiently with SSE instructions with a bit of JNI (similar
throughput to memcpy). The result is likely to have long runs of 1 or 0 bits if
the input block IDs are clustered around certain sets of values.
3) Run the result through LZ4 or Snappy.
You could optionally insert a differential encoding step between (1) and (2)
which would probably improve the compression ratio with little cost.
I didn't come up with the bit-shuffling idea - you can read more about it at
http://www.blosc.org/ which also has some benchmarks showing that it gets very
good compression performance and adds almost no overhead relative to LZ4. It's
also significantly faster than vint-encoding from a CPU standpoint (since vints
tend to be branchy)
> Compress block reports
> ----------------------
>
> Key: HDFS-7845
> URL: https://issues.apache.org/jira/browse/HDFS-7845
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: HDFS-7836
> Reporter: Colin Patrick McCabe
> Assignee: Charles Lamb
>
> We should optionally compress block reports using a low-cpu codec such as lz4
> or snappy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)