[ 
https://issues.apache.org/jira/browse/HADOOP-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170285#comment-13170285
 ] 

Binglin Chang commented on HADOOP-7657:
---------------------------------------

I'm afraid it's not typical, Terasort input data has lots of repeated strings:
.t^#\|v$2\         
0AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFFGGGGGGGGGGHHHHHHHH
75@~?'WdUF         
1IIIIIIIIIIJJJJJJJJJJKKKKKKKKKKLLLLLLLLLLMMMMMMMMMMNNNNNNNNNNOOOOOOOOOOPPPPPPPP
w[o||:N&H,         
2QQQQQQQQQQRRRRRRRRRRSSSSSSSSSSTTTTTTTTTTUUUUUUUUUUVVVVVVVVVVWWWWWWWWWWXXXXXXXX
^Eu)<n#kdP         
3YYYYYYYYYYZZZZZZZZZZAAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFF

Normally log data, web page collection and structured data(table, record set) 
have higher compression ratios(which is very common in Hadoop usecase), about 
20%-30% in my experience, plain text not so good, around 50%.

Just for reference, here are the test results from snappy unittest:
BM_ZFlat/0             270628     270443        737 361.1MB/s  html (23.57 %)
BM_ZFlat/1            3215660    3214530        100 208.3MB/s  urls (50.89 %)
BM_ZFlat/2              43917      43870       4533 2.7GB/s  jpg (99.88 %)
BM_ZFlat/3             123527     123369       1593 729.2MB/s  pdf (82.13 %)
BM_ZFlat/4            1098281    1097779        181 355.8MB/s  html4 (23.55 %)
BM_ZFlat/5             117902     117862       1669 199.1MB/s  cp (48.12 %)
BM_ZFlat/6              49592      49591       3824 214.4MB/s  c (42.40 %)
BM_ZFlat/7              12214      12202      14461 290.8MB/s  lsp (48.37 %)
BM_ZFlat/8            2898230    2895690        100 339.1MB/s  xls (41.34 %)
BM_ZFlat/9            1053508    1052727        187 137.8MB/s  txt1 (59.81 %)
BM_ZFlat/10            894211     893702        222 133.6MB/s  txt2 (64.07 %)
BM_ZFlat/11           2811850    2810680        100 144.8MB/s  txt3 (57.11 %)
BM_ZFlat/12           3594620    3592880        100 127.9MB/s  txt4 (68.35 %)
BM_ZFlat/13            991489     990943        194 493.9MB/s  bin (18.21 %)
BM_ZFlat/14            186471     186407       1076 195.6MB/s  sum (51.88 %)
BM_ZFlat/15             17664      17648      10672 228.4MB/s  man (59.36 %)
BM_ZFlat/16            259190     259137        770 436.4MB/s  pb (23.15 %)
BM_ZFlat/17            897617     896724        225 196.0MB/s  gaviota (38.27 %)

There is a test result from http://code.google.com/p/lz4/, using a more 
standard corpus with compression ratio around 50%.


                
> Add support for LZ4 compression
> -------------------------------
>
>                 Key: HADOOP-7657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7657
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Mr Bsd
>              Labels: compression
>
> According to several benchmark sites, LZ4 seems to overtake other fast 
> compression algorithms, especially in the decompression speed area. The 
> interface is also trivial to integrate 
> (http://code.google.com/p/lz4/source/browse/trunk/lz4.h) and there is no 
> license issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to