[
https://issues.apache.org/jira/browse/HBASE-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217456#comment-13217456
]
Nicolas Spiegelberg commented on HBASE-5479:
--------------------------------------------
@Matt: Also see HBASE-5335, which will allow you to change the compaction ratio
on a per-cf basis for multi-flow clusters. I am currently working on that
JIRA, so I suggest you watch it.
i.r.t. outdated requests. The fact that you have outdated requests should mean
an HBase bug, not design. A compaction request should lock all the StoreFiles
in question. These storefiles should only be removed by the compaction &
compaction requests should be disjoint. Any break of this contract is a bug
:P. Did this arise because of splitting?
> Postpone CompactionSelection to compaction execution time
> ---------------------------------------------------------
>
> Key: HBASE-5479
> URL: https://issues.apache.org/jira/browse/HBASE-5479
> Project: HBase
> Issue Type: New Feature
> Components: io, performance, regionserver
> Reporter: Matt Corgan
>
> It can be commonplace for regionservers to develop long compaction queues,
> meaning a CompactionRequest may execute hours after it was created. The
> CompactionRequest holds a CompactionSelection that was selected at request
> time but may no longer be the optimal selection. The CompactionSelection
> should be created at compaction execution time rather than compaction request
> time.
> The current mechanism breaks down during high volume insertion. The
> inefficiency is clearest when the inserts are finished. Inserting for 5
> hours may build up 50 storefiles and a 40 element compaction queue. When
> finished inserting, you would prefer that the next compaction merges all 50
> files (or some large subset), but the current system will churn through each
> of the 40 compaction requests, the first of which may be hours old. This
> ends up re-compacting the same data many times.
> The current system is especially inefficient when dealing with time series
> data where the data in the storefiles has minimal overlap. With time series
> data, there is even less benefit to intermediate merges because most
> storefiles can be eliminated based on their key range during a read, even
> without bloomfilters. The only goal should be to reduce file count, not to
> minimize number of files merged for each read.
> There are other aspects to the current queuing mechanism that would need to
> be looked at. You would want to avoid having the same Store in the queue
> multiple times. And you would want the completion of one compaction to
> possibly queue another compaction request for the store.
> A alternative architecture to the current style of queues would be to have
> each Store (all open in memory) keep a compactionPriority score up to date
> after events like flushes, compactions, schema changes, etc. Then you create
> a "CompactionPriorityComparator implements Comparator<Store>" and stick all
> the Stores into a PriorityQueue (synchronized remove/add from the queue when
> the value changes). The async compaction threads would keep pulling off the
> head of that queue as long as the head has compactionPriority > X.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira