Adam Fuchs created ACCUMULO-1787:
------------------------------------

             Summary: support two tier compression codec configuration
                 Key: ACCUMULO-1787
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
             Project: Accumulo
          Issue Type: Improvement
            Reporter: Adam Fuchs


Given our current configuration of one compression codec per table we have the 
option of leaning towards performance with something like snappy or leaning 
towards smaller footprint with something like gzip. With a change to the way we 
configure codecs we might be able to approach the best of both worlds. Consider 
the difference between files that have been written by major or minor 
compactions and files that exist at any given point in time. For better 
footprint on disk we care about the latter, but for total CPU usage over time 
we care about the former. The two distributions are distinct because Accumulo 
deletes files after major compactions. If we figure out whether a file is going 
to be long-lived at the time we write it then we can pick the compression codec 
that optimizes the relevant concern.

One way to distinguish is by file size. Accumulo writes many small files and 
later major compacts those away, so the distribution of written files is skewed 
towards smaller files while the distribution of files existing at any point in 
time is skewed towards larger files. I recommend for each table we support a 
general compression codec and a second codec for files under a configurable 
size.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to