[
https://issues.apache.org/jira/browse/ACCUMULO-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adam Fuchs updated ACCUMULO-1787:
---------------------------------
Attachment: ci_file_sizes.png
This is a distribution of file sizes produced during an hour of continuous
ingest on a single server. The green line represents compactions, and the red
line represents files remaining in HDFS at the end of the test.
> support two tier compression codec configuration
> ------------------------------------------------
>
> Key: ACCUMULO-1787
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
> Project: Accumulo
> Issue Type: Improvement
> Reporter: Adam Fuchs
> Attachments: ci_file_sizes.png
>
>
> Given our current configuration of one compression codec per table we have
> the option of leaning towards performance with something like snappy or
> leaning towards smaller footprint with something like gzip. With a change to
> the way we configure codecs we might be able to approach the best of both
> worlds. Consider the difference between files that have been written by major
> or minor compactions and files that exist at any given point in time. For
> better footprint on disk we care about the latter, but for total CPU usage
> over time we care about the former. The two distributions are distinct
> because Accumulo deletes files after major compactions. If we figure out
> whether a file is going to be long-lived at the time we write it then we can
> pick the compression codec that optimizes the relevant concern.
> One way to distinguish is by file size. Accumulo writes many small files and
> later major compacts those away, so the distribution of written files is
> skewed towards smaller files while the distribution of files existing at any
> point in time is skewed towards larger files. I recommend for each table we
> support a general compression codec and a second codec for files under a
> configurable size.
--
This message was sent by Atlassian JIRA
(v6.1#6144)