[jira] Updated: (HBASE-2987) Avoid compressing flush files

Andrew Purtell (JIRA) Fri, 10 Sep 2010 17:04:14 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Purtell updated HBASE-2987:
----------------------------------

    Attachment: HBASE-2987.patch

> Avoid compressing flush files
> -----------------------------
>
>                 Key: HBASE-2987
>                 URL: https://issues.apache.org/jira/browse/HBASE-2987
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>         Attachments: HBASE-2987.patch
>
>
> I've extended Hadoop compression to use the LZMA algorithm and HFile to 
> provide an option for selecting it. With typical input, the LZMA algorithm 
> produces 30% smaller output than GZIP at max compression (which is currently 
> the best available option for HFiles) and 15% smaller output than BZIP2. I'm 
> aware of the "disk is cheap" mantra but for a multi-peta-scale archival 
> application, where we still want random read and random update capabilities, 
> 30% less disk is a substantial cost savings. LZMA compression speed is ~1 
> MB/second on a 2 GHz CPU, decompression speed is ~20 MB/second. This is 4x 
> slower than BZIP2 to compress but at least 2x faster to decompress for 15% 
> better results. For an archival application these properties would be 
> acceptable if not for the very significant problem of flushing. Obviously the 
> low throughput of the LZMA compressor means it is unsuitable for foreground 
> processing. In HBase terms, it can be used for compaction but not for flush 
> files. 
> Attached patch, against 0.20 branch, turns off compression for flushes. This 
> could be implemented as a config option, but I wonder if with the possible 
> exception of LZO should we be compressing flushes at all? Any significant 
> reduction in flush throughput can stall writers during periods of high write 
> activity. Maybe globally disabling compression on flush flies is a good 
> thing? 
> I have tested this and confirmed the result is the desired behavior: 'file' 
> shows flush files as uncompressed data, compacted files as compressed. 
> Compaction merges files with different compression properties. LZMA provides 
> rather extreme space savings over the other available options without slowing 
> down writers if the regionservers are configured with enough write buffering 
> to ride over the significantly lengthened compaction times.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2987) Avoid compressing flush files

Reply via email to