[ https://issues.apache.org/jira/browse/ACCUMULO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Miller updated ACCUMULO-1266: ------------------------------------- Assignee: (was: Michael Miller) > Automatically determine when a full major compaction would benefit scans > ------------------------------------------------------------------------ > > Key: ACCUMULO-1266 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1266 > Project: Accumulo > Issue Type: New Feature > Reporter: Keith Turner > > For the following situation, there is a tipping point where it becomes > beneficial to do a full major compaction. > * a tablet is frequently scanned > * scan time iterators supress a lot of data > * a full major compaction would also supress that data > Examples of this are tablets with lots of deletes, versions that are > suppressed, data thats combined, and data thats filtered. > If tablet servers kept track of statistics about scans, could this be used to > determine when its beneficial to automatically compact? In the following > simple example, it seems obvious that a major compaction would be beneficial. > In this example scans over the last hour have had to examine and throw away > 20 million uneeded keys. Alot of scan work could have been saved by doing a > major compaction. > * all scans over tabletA within the last hour have read 30 million keys and > returned 10 million keys > * TabletA has 3 million keys > * a major compaction would reduce tabletA to 1 million keys and result in > future scans returning all keys read. > One complicating factor is that major compaction may have a different set of > iterators configured. Therefore its possible that scan may filter a lot of > data, and major compactions may not. Could possibly keep track of ratio of > data dropped by compactions and the ratio of data dropped by scans. This > could be used when deciding if a major compaction should be done to improve > scan performance. > What other situation can cause unnecessary major compactions and need to be > defended against? > In the case where a compaction of just the data in memory would benefit > scans, ACCUMULO-519 may solve the problem that this ticket is looking to > solve. > So what should the formula be? > {code:java} > // k/v : key values > // recentlyRead : total number of k/v read before applying iterators by > recent scans (recentlyRead - recentlyDropped equals # of k/v returned to > users) > // majcDropRatio : ratio of k/v dropped by recent major compactions > // totalKeyValues : total # of k/v in tablet > // R a user configurable ratio, like the current major compaction ratio > that is based on files > if((recentlyRead * majcDropRatio > R * totalKeyValues)){ > doFullMajorCompaction() > resetScanStats() > } > {code} > The example formula above has an issue, it may initiate a major compaction > when scans are not reading a part of the tablet that drops data. The formula > below tries to remedy this. > {code:java} > // k/v : key values > // recentlyDropped : number of k/v dropped by recent scans > // recentlyRead : total number of k/v read before applying iterators by > recent scans (recentlyRead - recentlyDropped equals # of k/v returned to > users) > // majcDropRatio : ratio of k/v dropped by recent major compactions > // totalKeyValues : total # of k/v in tablet > // R a user configurable ratio, like the current major compaction ratio > that is based on files > if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio > > R * totalKeyValues)){ > doFullMajorCompaction() > resetScanStats() > } > {code} > An issue with the above is that the total # of key values for a tablet may > not be accurate because of bulk import and splits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)