keith-turner opened a new issue #564: Add multiple compaction thread pools and 
allow multiple compactions per tablet
URL: https://github.com/apache/accumulo/issues/564
 
 
   Currently there is a single thread pool/executor for compactions and only a 
single compaction can run per tablet.  This can cause problems when a user 
initiates a single long running filter or transform compaction because new 
files build up and are not compacted.  Ideally a long running compaction for a 
tablet could run in executor1 while new tablets files are compacted in 
executor2.
   
   The current user pluggable CompactionStrategy class is not well suited for 
handling this case of multiple executors and  compactions per tablet.  The 
following design is better suited for mananging this concurrency in a way that 
is easy to understand.  In this design the CompactionManger and 
CompactionPrioritizer are user pluggable.  Currently, prioritization of queued 
compaction are not configurable. 
   
   | Functional components | Description |
   |-----------------------|-------------|
   | CompactionJob         | Immutable class that describes work to be done.  
Contains list of files to compact, info about iterators for user compactions, 
info about output file (like compression type). |
   | CompactionManager  | Per table class that decides what compactions to do 
for a tablet. Can create and cancel compactions jobs.  Can see list of existing 
jobs.  Can submit multiple jobs for a table as long as files are disjoint. This 
class decides which executor should process a job.   |
   | CompactionPrioritizer | Per executor class that decides which compaction 
job to execute next. |
   | CompactionExecutor    | Each tablet server has one or more executors that 
process compaction jobs.  These are configured system wide. Number of threads, 
rate limits, max file per compaction are some things that can be configured.  
If a job exceeds the max files, then the executor will process it in multiple 
passes.|
   
   One major goal with this design is to make it easy for the user to write 
code that avoids concurrency mayhem.  The idea underlying this that a 
compaction manager will be called in the following way.
   
    * System gathers a snapshot of tablet files and current compaction jobs.
    * System calls compaction manger with gathered snapshots.
    * The compaction manager returns jobs to cancel and new jobs to run.
    * If the set of files and/or jobs has changed the decisions are ignored and 
the manager is called again.
   
   With this model the prioritizer is dealing with immutable jobs that will not 
magically change when its time to run the job (how current compaction strategy 
works).  This makes reasoning about creating, canceling, and prioritizing jobs 
sane.
   
   The following is an example of how this might work.  In this example assume 
executor E1 is intended for small compactions and executor E2 is for large 
compactions. Small vs large could be a function of the input file sizes.
   
    * Tablet T1 has three files F1,F2,F3
    * Compaction manger decides to compact F2 and F3 on executor E1 as job J1
    * A new file F4 is added to T1
    * J1 is still queued on E1
    * Compaction manger decides to cancel J1 and compact F1,F2,F3,andF4 on 
executor E2 as J2.
    * Nothing changed, so J1 is canceled and J2 is submitted. 
   
   For user initiated compactions, compaction strategies would still be used 
for compatibility.  The behavior should be the following :
     * Cancel existing queued jobs (that are system initiated) and prevent more 
jobs from qeueing
     * Wait for any running jobs to complete
     * Apply the users strategy and create a job.
     * Ask the compaction manager which executor the job should be queued on. 
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to