keith-turner commented on PR #4098:
URL: https://github.com/apache/accumulo/pull/4098#issuecomment-1867937514
> I would love if instead of deprecating this we actually made it an
imperative to compact.
I have been thinking about how this could be achieved efficiently. The way
the old code worked for max files was so inefficient. The compaction ratio
efficiently reduces the number of files, but it is hard to reason about the
number of files it will produce.
One problem with the compaction ratio is that the total number of files you
end up with in a tablet depends on the smallest expected files For example if
really small files are arriving in a tablet, like 1000 bytes. Then you will end
up with 3x1K files, 3x3K files, 3x9K files, etc. So starting at 1K with a
compaction ratio of 3, you end up with 20 levels or 60 total files before
getting to 1G. So with the current compaction ratio the number of levels and
total number files you end up with depends s on the smallest file arriving.
For larger files we want a higher compaction ratio, but maybe we don't want
that for smaller files. This led me to wonder about making the compaction
ratio a function of the file size. Something like the following that more
aggressively reduces the number of small files, while still doing logarithmic
work.
```java
// function that given a file size computes a compaction ratio
Function<Long,Double> ratioFunction = size -> {
if(size < 1_000) {
return 1.0;
} else if(size < 10_000) {
return 1.25;
} else if(size < 100_000) {
return 1.5;
}else if(size < 1_000_000) {
return 1.75;
} else if(size < 10_000_000) {
return 2.0;
} else if(size < 100_000_000) {
return 2.25;
} else {
return 3.0;
}
};
```
Tying to figure out if we could generate the above function given a target
number of max files and expected max size that accomodates any minimum file
size. So a user does not specify the above function, but one is generated
based on desired max files and expected max file size. If this could be done,
it could be implemented as another compaction planner for 2.x.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]