[ 
https://issues.apache.org/jira/browse/HADOOP-5793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982338#action_12982338
 ] 

Binglin Chang commented on HADOOP-5793:
---------------------------------------

Luke: I read the paper "Data compression using long common strings" which 
discribes BMDiff, it seems that the main advance of BMDiff is be capable of 
finding long common strings in the entire file(not only the sliding window in 
dict based algorithms) but hadoop use a streaming compression framework, which 
sends one block(buffer) at a time to compressor/decompressor, which prevents 
BMDiff from finding repeated strings in the entire file, and maybe leads to bad 
compression results? Is there any test results shows the relationship between 
pack(buffer) size, compression speed and ratio?

> High speed compression algorithm like BMDiff
> --------------------------------------------
>
>                 Key: HADOOP-5793
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5793
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: elhoim gibor
>            Assignee: Michele Catasta
>            Priority: Minor
>
> Add a high speed compression algorithm like BMDiff.
> It gives speeds ~100MB/s for writes and ~1000MB/s for reads, compressing 
> 2.1billions web pages from 45.1TB in 4.2TB
> Reference:
> http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/details.cgi?id=437
> 2005 Jeff Dean talk about google architecture - around 46:00.
> http://feedblog.org/2008/10/12/google-bigtable-compression-zippy-and-bmdiff/
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=755678
> A reference implementation exists in HyperTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to