[
https://issues.apache.org/jira/browse/HBASE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578894#comment-13578894
]
Sergey Shelukhin commented on HBASE-7842:
-----------------------------------------
Yeah, that was my thought above, look at all groups w/o taking ratio into
account, find the best compaction, and then judge if its goodness is higher
than some cutoff.
The metric for best and for cutoff can vary...
sum of file size - the compacting files or all files? I assume of compacting
files.
bq. We're trying to use size as a way to group files that are similar. If
there's a use case that has traffic come in waves we want to group the smaller
files up to create larger files before compacting the larger files.
There's no reason why finding similar files should depend on ordering (i.e.
looking at sum of next N file sizes in sequence.)
So 10 7 3 4 - all files satisfy the criteria (I am assuming it's not applied to
last file). But in 7 10 3 4 only last two files do.
> Add compaction policy that explores more storefile groups
> ---------------------------------------------------------
>
> Key: HBASE-7842
> URL: https://issues.apache.org/jira/browse/HBASE-7842
> Project: HBase
> Issue Type: New Feature
> Components: Compaction
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Assignee: Elliott Clark
>
> Some workloads that are not as stable can have compactions that are too large
> or too small using the current storefile selection algorithm.
> Currently:
> * Find the first file that Size(fi) <= Sum(0, i-1, FileSize(fx))
> * Ensure that there are the min number of files (if there aren't then bail
> out)
> * If there are too many files keep the larger ones.
> I would propose something like:
> * Find all sets of storefiles where every file satisfies
> ** FileSize(fi) <= Sum(0, i-1, FileSize(fx))
> ** Num files in set =< max
> ** Num Files in set >= min
> * Then pick the set of files that maximizes ((# storefiles in set) /
> Sum(FileSize(fx)))
> The thinking is that the above algorithm is pretty easy reason about, all
> files satisfy the ratio, and should rewrite the least amount of data to get
> the biggest impact in seeks.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira