[
https://issues.apache.org/jira/browse/HBASE-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217410#comment-13217410
]
Matt Corgan commented on HBASE-5479:
------------------------------------
{quote}you need to do a bulk import MR (vs Put-based) or you have your
compaction algorithm tuned incorrectly... you probably want to switch your
compaction ratio to 0.125 and play with it from there{quote}
yeah, just using it as an opportunity to push HBase with real data to see what
breaks first. i hesitate to change the global compaction ratio when it's just
a couple out of ~20 tables
Agree pluggable compaction strategies would be great, as would many other
per-CF settings. Making them pluggable would be far more useful than
perfecting a general algorithm.
Is there a quick fix that could deal with outdated requests? Like ignoring a
CompactionRequest if the files in its CompactionSelection are not all there.
Or when pulling a CompactionRequest from the head of the queue, iterate the
entire queue to check if there's a newer CompactionRequest for the same Store.
> Postpone CompactionSelection to compaction execution time
> ---------------------------------------------------------
>
> Key: HBASE-5479
> URL: https://issues.apache.org/jira/browse/HBASE-5479
> Project: HBase
> Issue Type: New Feature
> Components: io, performance, regionserver
> Reporter: Matt Corgan
>
> It can be commonplace for regionservers to develop long compaction queues,
> meaning a CompactionRequest may execute hours after it was created. The
> CompactionRequest holds a CompactionSelection that was selected at request
> time but may no longer be the optimal selection. The CompactionSelection
> should be created at compaction execution time rather than compaction request
> time.
> The current mechanism breaks down during high volume insertion. The
> inefficiency is clearest when the inserts are finished. Inserting for 5
> hours may build up 50 storefiles and a 40 element compaction queue. When
> finished inserting, you would prefer that the next compaction merges all 50
> files (or some large subset), but the current system will churn through each
> of the 40 compaction requests, the first of which may be hours old. This
> ends up re-compacting the same data many times.
> The current system is especially inefficient when dealing with time series
> data where the data in the storefiles has minimal overlap. With time series
> data, there is even less benefit to intermediate merges because most
> storefiles can be eliminated based on their key range during a read, even
> without bloomfilters. The only goal should be to reduce file count, not to
> minimize number of files merged for each read.
> There are other aspects to the current queuing mechanism that would need to
> be looked at. You would want to avoid having the same Store in the queue
> multiple times. And you would want the completion of one compaction to
> possibly queue another compaction request for the store.
> A alternative architecture to the current style of queues would be to have
> each Store (all open in memory) keep a compactionPriority score up to date
> after events like flushes, compactions, schema changes, etc. Then you create
> a "CompactionPriorityComparator implements Comparator<Store>" and stick all
> the Stores into a PriorityQueue (synchronized remove/add from the queue when
> the value changes). The async compaction threads would keep pulling off the
> head of that queue as long as the head has compactionPriority > X.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira