[
https://issues.apache.org/jira/browse/ACCUMULO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Miller updated ACCUMULO-1266:
-------------------------------------
Assignee: (was: Michael Miller)
> Automatically determine when a full major compaction would benefit scans
> ------------------------------------------------------------------------
>
> Key: ACCUMULO-1266
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1266
> Project: Accumulo
> Issue Type: New Feature
> Reporter: Keith Turner
>
> For the following situation, there is a tipping point where it becomes
> beneficial to do a full major compaction.
> * a tablet is frequently scanned
> * scan time iterators supress a lot of data
> * a full major compaction would also supress that data
> Examples of this are tablets with lots of deletes, versions that are
> suppressed, data thats combined, and data thats filtered.
> If tablet servers kept track of statistics about scans, could this be used to
> determine when its beneficial to automatically compact? In the following
> simple example, it seems obvious that a major compaction would be beneficial.
> In this example scans over the last hour have had to examine and throw away
> 20 million uneeded keys. Alot of scan work could have been saved by doing a
> major compaction.
> * all scans over tabletA within the last hour have read 30 million keys and
> returned 10 million keys
> * TabletA has 3 million keys
> * a major compaction would reduce tabletA to 1 million keys and result in
> future scans returning all keys read.
> One complicating factor is that major compaction may have a different set of
> iterators configured. Therefore its possible that scan may filter a lot of
> data, and major compactions may not. Could possibly keep track of ratio of
> data dropped by compactions and the ratio of data dropped by scans. This
> could be used when deciding if a major compaction should be done to improve
> scan performance.
> What other situation can cause unnecessary major compactions and need to be
> defended against?
> In the case where a compaction of just the data in memory would benefit
> scans, ACCUMULO-519 may solve the problem that this ticket is looking to
> solve.
> So what should the formula be?
> {code:java}
> // k/v : key values
> // recentlyRead : total number of k/v read before applying iterators by
> recent scans (recentlyRead - recentlyDropped equals # of k/v returned to
> users)
> // majcDropRatio : ratio of k/v dropped by recent major compactions
> // totalKeyValues : total # of k/v in tablet
> // R a user configurable ratio, like the current major compaction ratio
> that is based on files
> if((recentlyRead * majcDropRatio > R * totalKeyValues)){
> doFullMajorCompaction()
> resetScanStats()
> }
> {code}
> The example formula above has an issue, it may initiate a major compaction
> when scans are not reading a part of the tablet that drops data. The formula
> below tries to remedy this.
> {code:java}
> // k/v : key values
> // recentlyDropped : number of k/v dropped by recent scans
> // recentlyRead : total number of k/v read before applying iterators by
> recent scans (recentlyRead - recentlyDropped equals # of k/v returned to
> users)
> // majcDropRatio : ratio of k/v dropped by recent major compactions
> // totalKeyValues : total # of k/v in tablet
> // R a user configurable ratio, like the current major compaction ratio
> that is based on files
> if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio
> > R * totalKeyValues)){
> doFullMajorCompaction()
> resetScanStats()
> }
> {code}
> An issue with the above is that the total # of key values for a tablet may
> not be accurate because of bulk import and splits.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)