keith-turner commented on PR #4098:
URL: https://github.com/apache/accumulo/pull/4098#issuecomment-1867937514

   > I would love if instead of deprecating this we actually made it an 
imperative to compact.
   
   I have been thinking about how this could be achieved efficiently.  The way 
the old code worked for max files was so inefficient.  The compaction ratio 
efficiently reduces the number of files, but it is hard to reason about the 
number of files it will produce.
   
   One problem with the compaction ratio is that the total number of files you 
end up with in a tablet depends on the smallest expected files  For example if 
really small files are arriving in a tablet, like 1000 bytes. Then you will end 
up with 3x1K files, 3x3K files, 3x9K files, etc.  So starting at 1K with a 
compaction ratio of 3, you end up with 20 levels or 60 total files before 
getting to 1G. So with the current compaction ratio the number of levels and 
total number files you end up with depends s on the smallest file arriving. 
   
   For larger files we want a higher compaction ratio, but maybe we don't want 
that for smaller files.  This led me to wonder about making the compaction 
ratio a function of the file size.  Something like the following that more 
aggressively reduces the number of small files, while still doing logarithmic 
work.
   
   ```java
         // function that given a file size computes a compaction ratio
         Function<Long,Double> ratioFunction = size -> {
          if(size < 1_000) {
               return 1.0;
           } else if(size < 10_000) {
             return 1.25;
           } else if(size < 100_000) {
             return 1.5;
           }else if(size < 1_000_000) {
             return 1.75;
           } else if(size < 10_000_000) {
             return 2.0;
           } else if(size < 100_000_000) {
             return 2.25;
           } else {
             return 3.0;
           }
         };
   ```
   
   Tying to figure out if we could generate the above function given a target 
number of max files and expected max size that accomodates any minimum file 
size.  So a user does not specify the above function, but one is generated 
based on desired max files and expected max file size.  If this could be done, 
it could be implemented as another compaction planner for 2.x.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to