[ 
https://issues.apache.org/jira/browse/HDFS-7845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341828#comment-14341828
 ] 

Todd Lipcon commented on HDFS-7845:
-----------------------------------

fwiw I ran a quick test on a random DN I found in one of our test clusters. It 
doesn't have that many blocks (only 2300 per disk) because it's not meant for 
testing scalability, but I figured it'd be better than nothing:

First I generated a binary file with the block IDs encoded in 8-byte packed 
form:
{code}
$  find . | grep -v meta | grep -o 'blk_[0-9]*' | sed -e 's,blk_,,' | sort -n | 
perl -ne 'print pack("Q", $_);' > /tmp/blocks.bin
{code}
Then I loaded it into ipython to try some different compression options:

{code}
In [16]: d = file("/tmp/blocks.bin").read()

In [17]: len(d)
Out[17]: 18520

In [18]: len(lz4.compress(d))
Out[18]: 9666

In [19]: len(blosc.compress(d, typesize=8))
Out[19]: 3838

In [20]: len(zlib.compress(d))
Out[20]: 4174

In [21]: len(lz4.compressHC(d))
Out[21]: 9510
{code}

i.e on this particular workload, doing shuffle+lz4 is about 2.5x better than 
straight lz4, and actually does better than zlib. We get about 5x compression 
vs the raw ints. Would be curious if someone can try this on a larger cluster 
(or by dumping the actual set of live block ids from an fsimage).

BTW, I'd also guess that you'd see good gains on compressing the sizes of the 
blocks in this manner as well.

> Compress block reports
> ----------------------
>
>                 Key: HDFS-7845
>                 URL: https://issues.apache.org/jira/browse/HDFS-7845
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7836
>            Reporter: Colin Patrick McCabe
>            Assignee: Charles Lamb
>
> We should optionally compress block reports using a low-cpu codec such as lz4 
> or snappy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to