[ 
https://issues.apache.org/jira/browse/ACCUMULO-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830861#comment-13830861
 ] 

Chris McCubbin commented on ACCUMULO-1787:
------------------------------------------

Sure, here's a patch that should work on the current 1.5.1-SNAPSHOT. It adds a 
couple config parameters: table.file.large.compress.threshold and 
table.file.large.compress.type. Also talking with [~afuchs] , I think sending 
the estimated size into the openWriter() method may be better accomplished by 
appending a value to the Accumulo Config rather than alter the method signature 
like I have in this patch.

> support two tier compression codec configuration
> ------------------------------------------------
>
>                 Key: ACCUMULO-1787
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Adam Fuchs
>         Attachments: ci_file_sizes.png
>
>
> Given our current configuration of one compression codec per table we have 
> the option of leaning towards performance with something like snappy or 
> leaning towards smaller footprint with something like gzip. With a change to 
> the way we configure codecs we might be able to approach the best of both 
> worlds. Consider the difference between files that have been written by major 
> or minor compactions and files that exist at any given point in time. For 
> better footprint on disk we care about the latter, but for total CPU usage 
> over time we care about the former. The two distributions are distinct 
> because Accumulo deletes files after major compactions. If we figure out 
> whether a file is going to be long-lived at the time we write it then we can 
> pick the compression codec that optimizes the relevant concern.
> One way to distinguish is by file size. Accumulo writes many small files and 
> later major compacts those away, so the distribution of written files is 
> skewed towards smaller files while the distribution of files existing at any 
> point in time is skewed towards larger files. I recommend for each table we 
> support a general compression codec and a second codec for files under a 
> configurable size.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to