[
https://issues.apache.org/jira/browse/HBASE-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578203#comment-13578203
]
Elliott Clark commented on HBASE-7842:
--------------------------------------
bq.what I don't understand is the first new condition for every file. E.g.
(assume ratio 1) in order 10 7 4 5 is good to compact but 7 10 4 5 is not
We're trying to use size as a way to group files that are similar. If there's
a use case that has traffic come in waves we want to group the smaller files up
to create larger files before compacting the larger files.
For something like:
[100 100 100 50 50 50 50 50 100 100]
I'd rather have the 50's choosen and then compact the 100's later as they seem
more similar and a good way to try and not re-write the same data over and over
again.
You are correct the main point that I want to dive into is that we are not
looking at many of the possible groupings right now, and it seems like a deep
search would be cheap and could find better compactions.
One of the things Stack and I thought about was not using the ratio at all for
grouping files. Instead using it for deciding if we found a compaction that's
good enough.
Something like:
ratio = 1.2
files to compact = 5
sum of file size = 80
average store file size (across all files in the store) = 20
(80 / 20 ) / 5 < 1.2 so yes we compact.
That's interesting and probably something that I'll try. However I wanted to
start with something that's a tweak of the algorithm that we have now and then
branch out.
bq.Back of the napkin calculation tells me that dumb exploration of ALL ordered
permutations should be fast
Yep. The runtime of choosing files isn't something that should be a major
concern as long as things are not spiraling out of control.
> Add compaction policy that explores more storefile groups
> ---------------------------------------------------------
>
> Key: HBASE-7842
> URL: https://issues.apache.org/jira/browse/HBASE-7842
> Project: HBase
> Issue Type: New Feature
> Components: Compaction
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Assignee: Elliott Clark
>
> Some workloads that are not as stable can have compactions that are too large
> or too small using the current storefile selection algorithm.
> Currently:
> * Find the first file that Size(fi) <= Sum(0, i-1, FileSize(fx))
> * Ensure that there are the min number of files (if there aren't then bail
> out)
> * If there are too many files keep the larger ones.
> I would propose something like:
> * Find all sets of storefiles where every file satisfies
> ** FileSize(fi) <= Sum(0, i-1, FileSize(fx))
> ** Num files in set =< max
> ** Num Files in set >= min
> * Then pick the set of files that maximizes ((# storefiles in set) /
> Sum(FileSize(fx)))
> The thinking is that the above algorithm is pretty easy reason about, all
> files satisfy the ratio, and should rewrite the least amount of data to get
> the biggest impact in seeks.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira