keith-turner commented on issue #4089:
URL: https://github.com/apache/accumulo/issues/4089#issuecomment-1864730008

   Looking into this I learned a few interesting things.
   
    * In Accumulo 1.x (and 2.x if using the old configs) the default compaction 
strategy would do [suboptimal 
compactions](https://github.com/apache/accumulo/blob/8dfb4eb71351b856a398db7d0f75b8ddda85ce3a/server/tserver/src/main/java/org/apache/accumulo/tserver/compaction/DefaultCompactionStrategy.java#L235-L239)
 if the the tablets files exceeded [this 
calculation](https://github.com/apache/accumulo/blob/8dfb4eb71351b856a398db7d0f75b8ddda85ce3a/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L381-L389).
  Suboptimal means not following the compaction ratio, which could lead to the 
number of rewrites key/values by compactions being quadratic compaction instead 
of logarithmic.
    * The new DefaultCompactionPlanner in Accumulo 2.x never create system 
compactions that do not follow the compaction ratio.  So it does not look at 
the config like 1.x did and just create a suboptimal compaction if a tablets 
files are over the limit.
     * Nothing in 2.x is really using the table.file.max property.  Seems like 
its only used by the deprecated compaction code.  In 1.x it was also used for 
merging minor compactions, which were dropped in 2.x.
   
   Was wondering if the compaction planner in 2.x should also schedule sub 
optimal compactions.  One reason not to is that it causes compaction threads to 
be overburdened with suboptimal work which could cause files to build up on 
other tablets, which could increase the demand for suboptimal compactions 
possibly spiraling.  1.x code probably should have warned when it did this, if 
its repeatedly happening its an indication that compaction config needs to be 
adjusted.  When its repeatedly happening then resources are not being well 
utilized, using CPU and I/O very inefficiently.  The flipside of inefficient 
compactions, is that less files make scans more efficient.  However, if 
compactions are inefficient and as a result all compaction resources are used 
then files can not be compacted even if there is a desire to compact.
   
   This brings up an issue of what to do with the table.file.max property.  
Thinking the following is one path
   
    1. In 2.x modify the default compaction planner to do suboptimal 
compactions for compatability with the table.file.max property and log a 
warning each time it does.
    2. In 2.x add code outside the compaction planner that warns when a tablets 
files are over tserver.scan.files.open.max and no compactions were scheduled.
    3. In 3.x, maybe deprecate the table.file.max property or rethink it?  One 
goal behind the property is to allow some tables to have less files for more 
efficient scans.  The most efficient way to achieve this goal is to lower the 
compaction ratio for the table rather than adjust table.files.max, so not sure 
the property serves any useful purpose.
   
   Not quite sure what to do at the moment, still thinking through it.  Does 
anyone else have suggestions?  I am going to implement #2 above for now because 
when a tablet has more than tserver.scan.files.open.max then scans will start 
to break.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to