[
https://issues.apache.org/jira/browse/HADOOP-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170285#comment-13170285
]
Binglin Chang commented on HADOOP-7657:
---------------------------------------
I'm afraid it's not typical, Terasort input data has lots of repeated strings:
.t^#\|v$2\
0AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFFGGGGGGGGGGHHHHHHHH
75@~?'WdUF
1IIIIIIIIIIJJJJJJJJJJKKKKKKKKKKLLLLLLLLLLMMMMMMMMMMNNNNNNNNNNOOOOOOOOOOPPPPPPPP
w[o||:N&H,
2QQQQQQQQQQRRRRRRRRRRSSSSSSSSSSTTTTTTTTTTUUUUUUUUUUVVVVVVVVVVWWWWWWWWWWXXXXXXXX
^Eu)<n#kdP
3YYYYYYYYYYZZZZZZZZZZAAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFF
Normally log data, web page collection and structured data(table, record set)
have higher compression ratios(which is very common in Hadoop usecase), about
20%-30% in my experience, plain text not so good, around 50%.
Just for reference, here are the test results from snappy unittest:
BM_ZFlat/0 270628 270443 737 361.1MB/s html (23.57 %)
BM_ZFlat/1 3215660 3214530 100 208.3MB/s urls (50.89 %)
BM_ZFlat/2 43917 43870 4533 2.7GB/s jpg (99.88 %)
BM_ZFlat/3 123527 123369 1593 729.2MB/s pdf (82.13 %)
BM_ZFlat/4 1098281 1097779 181 355.8MB/s html4 (23.55 %)
BM_ZFlat/5 117902 117862 1669 199.1MB/s cp (48.12 %)
BM_ZFlat/6 49592 49591 3824 214.4MB/s c (42.40 %)
BM_ZFlat/7 12214 12202 14461 290.8MB/s lsp (48.37 %)
BM_ZFlat/8 2898230 2895690 100 339.1MB/s xls (41.34 %)
BM_ZFlat/9 1053508 1052727 187 137.8MB/s txt1 (59.81 %)
BM_ZFlat/10 894211 893702 222 133.6MB/s txt2 (64.07 %)
BM_ZFlat/11 2811850 2810680 100 144.8MB/s txt3 (57.11 %)
BM_ZFlat/12 3594620 3592880 100 127.9MB/s txt4 (68.35 %)
BM_ZFlat/13 991489 990943 194 493.9MB/s bin (18.21 %)
BM_ZFlat/14 186471 186407 1076 195.6MB/s sum (51.88 %)
BM_ZFlat/15 17664 17648 10672 228.4MB/s man (59.36 %)
BM_ZFlat/16 259190 259137 770 436.4MB/s pb (23.15 %)
BM_ZFlat/17 897617 896724 225 196.0MB/s gaviota (38.27 %)
There is a test result from http://code.google.com/p/lz4/, using a more
standard corpus with compression ratio around 50%.
> Add support for LZ4 compression
> -------------------------------
>
> Key: HADOOP-7657
> URL: https://issues.apache.org/jira/browse/HADOOP-7657
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Mr Bsd
> Labels: compression
>
> According to several benchmark sites, LZ4 seems to overtake other fast
> compression algorithms, especially in the decompression speed area. The
> interface is also trivial to integrate
> (http://code.google.com/p/lz4/source/browse/trunk/lz4.h) and there is no
> license issue.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira