[ 
https://issues.apache.org/jira/browse/HBASE-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217550#comment-13217550
 ] 

Matt Corgan commented on HBASE-5479:
------------------------------------

re, outdated requests: i now see that in Store.requestCompaction, you are 
eliminating already queued files from consideration, so requested files will 
never have disappeared between when a compaction is requested vs executed.

Let me take another stab at explaining the problem.  Say you have 
hbase.hstore.compactionThreshold=3, hbase.hstore.compaction.max=20.  You are 
flushing a particular memstore every minute and compactions are backed up by an 
hour for whatever reason.  After 3 minutes of inserting, the CompactSplitThread 
will create a CompactionRequest for the first 3 StoreFiles.  During the next 
hour, while that first CompactionRequest is sitting in the queue, 60 new 
StoreFiles were added, and 20 additional CompactionRequests were queued.

Finally, the first CompactionRequest makes it to the head of the queue and is 
ready to be executed.  At this point, there are 63 small StoreFiles in the 
Store.  While this original CompactionRequest was correct at the time it was 
created, I would now prefer that it compacted the first 20 files, not just the 
first 3.

Maybe it could abort a CompactionRequest if there are already items in 
Store.filesCompacting.
                
> Postpone CompactionSelection to compaction execution time
> ---------------------------------------------------------
>
>                 Key: HBASE-5479
>                 URL: https://issues.apache.org/jira/browse/HBASE-5479
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>            Reporter: Matt Corgan
>
> It can be commonplace for regionservers to develop long compaction queues, 
> meaning a CompactionRequest may execute hours after it was created.  The 
> CompactionRequest holds a CompactionSelection that was selected at request 
> time but may no longer be the optimal selection.  The CompactionSelection 
> should be created at compaction execution time rather than compaction request 
> time.
> The current mechanism breaks down during high volume insertion.  The 
> inefficiency is clearest when the inserts are finished.  Inserting for 5 
> hours may build up 50 storefiles and a 40 element compaction queue.  When 
> finished inserting, you would prefer that the next compaction merges all 50 
> files (or some large subset), but the current system will churn through each 
> of the 40 compaction requests, the first of which may be hours old.  This 
> ends up re-compacting the same data many times.  
> The current system is especially inefficient when dealing with time series 
> data where the data in the storefiles has minimal overlap.  With time series 
> data, there is even less benefit to intermediate merges because most 
> storefiles can be eliminated based on their key range during a read, even 
> without bloomfilters.  The only goal should be to reduce file count, not to 
> minimize number of files merged for each read.
> There are other aspects to the current queuing mechanism that would need to 
> be looked at.  You would want to avoid having the same Store in the queue 
> multiple times.  And you would want the completion of one compaction to 
> possibly queue another compaction request for the store.
> A alternative architecture to the current style of queues would be to have 
> each Store (all open in memory) keep a compactionPriority score up to date 
> after events like flushes, compactions, schema changes, etc.  Then you create 
> a "CompactionPriorityComparator implements Comparator<Store>" and stick all 
> the Stores into a PriorityQueue (synchronized remove/add from the queue when 
> the value changes).  The async compaction threads would keep pulling off the 
> head of that queue as long as the head has compactionPriority > X.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to