[ 
https://issues.apache.org/jira/browse/ACCUMULO-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Miller updated ACCUMULO-1266:
-------------------------------------
    Assignee:     (was: Michael Miller)

> Automatically determine when a full major compaction would benefit scans
> ------------------------------------------------------------------------
>
>                 Key: ACCUMULO-1266
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1266
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>
> For the following situation, there is a tipping point where it becomes 
> beneficial to do a full major compaction.
>  * a tablet is frequently scanned
>  * scan time iterators supress a lot of data
>  * a full major compaction would also supress that data 
> Examples of this are tablets with lots of deletes, versions that are 
> suppressed, data thats combined, and data thats filtered.   
> If tablet servers kept track of statistics about scans, could this be used to 
> determine when its beneficial to automatically compact?  In the following 
> simple example, it seems obvious that a major compaction would be beneficial. 
> In this example scans over the last hour have had to examine and throw away 
> 20 million uneeded keys.  Alot of scan work could have been saved by doing a 
> major compaction.
>  * all scans over tabletA within the last hour have read 30 million keys and 
> returned 10 million keys 
>  * TabletA has 3 million keys
>  * a major compaction would reduce tabletA to 1 million keys and result in 
> future scans returning all keys read.
> One complicating factor is that major compaction may have a different set of 
> iterators configured.  Therefore its possible that scan may filter a lot of 
> data, and major compactions may not.   Could possibly keep track of ratio of 
> data dropped by compactions and the ratio of data dropped by scans.  This 
> could be used when deciding if a major compaction should be done to improve 
> scan performance.
> What other situation can cause unnecessary major compactions and need to be 
> defended against?
> In the case where a compaction of just the data in memory would benefit 
> scans, ACCUMULO-519 may solve the problem that this ticket is looking to 
> solve.
> So what should the formula be?  
> {code:java}
>   // k/v : key values
>   // recentlyRead    : total number of k/v read before applying iterators by 
> recent scans (recentlyRead - recentlyDropped equals # of k/v returned to 
> users)
>   // majcDropRatio   : ratio of k/v dropped by recent major compactions
>   // totalKeyValues  : total # of k/v in tablet
>   // R a user configurable ratio, like the current major compaction ratio 
> that is based on files
>   if((recentlyRead * majcDropRatio > R * totalKeyValues)){
>      doFullMajorCompaction()
>      resetScanStats()
>   }
> {code}
> The example formula above has an issue, it may initiate a major compaction 
> when scans are not reading a part of the tablet that drops data.  The formula 
> below tries to remedy this.
> {code:java}
>   // k/v : key values
>   // recentlyDropped : number of k/v dropped by recent scans
>   // recentlyRead    : total number of k/v read before applying iterators by 
> recent scans (recentlyRead - recentlyDropped equals # of k/v returned to 
> users)
>   // majcDropRatio   : ratio of k/v dropped by recent major compactions
>   // totalKeyValues  : total # of k/v in tablet
>   // R a user configurable ratio, like the current major compaction ratio 
> that is based on files
>   if((recentlyDropped > R * totalKeyValues) && (recentlyRead * majcDropRatio 
> > R * totalKeyValues)){
>      doFullMajorCompaction()
>      resetScanStats()
>   }
> {code}
> An issue with the above is that the total # of key values for a tablet may 
> not be accurate because of bulk import and splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to