[ 
https://issues.apache.org/jira/browse/ACCUMULO-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830835#comment-13830835
 ] 

Chris McCubbin commented on ACCUMULO-1787:
------------------------------------------

I've been testing this idea. I modified accumulo to change compression for a 
compaction based on the sum of the incoming file sizes. Minor compactions 
report an incoming file size of 0 under this scheme. I currently am using one 
threshold where compression changes from one type to the other. 

For these tests I am running a 6 "Large" machine AMI cluster with 1 master and 
5 tablet servers and the 2GB config setting above Hadoop 1.2.1. For the results 
below I am running continuous ingest, one ingester per tablet server. The 
ingesters in this test are each set to ingest 250 million key-value pairs for a 
total of 1.25 billion pairs. The test takes on the order of 1-2 hours to 
complete (i.e. ingest is nontrivial and involves many major compactions). So 
far I have tested traditional no-compression, snappy, and gz and a hybrid 
snappy/gz. From some previous tests a compression switching threshold of 200MB 
seems to be a good starting point so that is what I used for this test. Results:

|| Scheme || Total time to ingest (s) ||Total time to ingest + finish 
compactions (s) || Total disk usage (GB) ||
| No compression | 5719 | 5939 | 128.5 | 
| Snappy | 5900 | 6149 | 75.2 |
| gzip | 8988 | 9831 | 48.8 |
| hybrid snappy-gzip | 6582 | 6845 | 51.7 |

So according to these preliminary results (and others I have run as well) this 
idea seems to have merit: The disk usage from the hybrid scheme is only a few 
percent larger than the all-gzip scheme while completing in 3/4 of the time 
(and not much slower than the snappy or no compression schemes).

Implementation is very simple: a single value, the estimated file size of the 
new file, needs to be communicated from the Compactor to the openWriter() 
method in RFileOperations. Compactor can compute this value for very little 
cost because the size of the files to be compacted has already been computed by 
the time the new file is to be opened.

> support two tier compression codec configuration
> ------------------------------------------------
>
>                 Key: ACCUMULO-1787
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1787
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Adam Fuchs
>         Attachments: ci_file_sizes.png
>
>
> Given our current configuration of one compression codec per table we have 
> the option of leaning towards performance with something like snappy or 
> leaning towards smaller footprint with something like gzip. With a change to 
> the way we configure codecs we might be able to approach the best of both 
> worlds. Consider the difference between files that have been written by major 
> or minor compactions and files that exist at any given point in time. For 
> better footprint on disk we care about the latter, but for total CPU usage 
> over time we care about the former. The two distributions are distinct 
> because Accumulo deletes files after major compactions. If we figure out 
> whether a file is going to be long-lived at the time we write it then we can 
> pick the compression codec that optimizes the relevant concern.
> One way to distinguish is by file size. Accumulo writes many small files and 
> later major compacts those away, so the distribution of written files is 
> skewed towards smaller files while the distribution of files existing at any 
> point in time is skewed towards larger files. I recommend for each table we 
> support a general compression codec and a second codec for files under a 
> configurable size.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to