Hi, Playing with the TarMK compaction lately, I realized that the process may create additional files even if globally there is no real need to do so (not enough garbage to justify running compaction).
The way it works now is: you manually trigger the compaction process, this will start copying content (via a diff) to new files to allow the old tar files to be GC'ed. Once done, the cleanup process starts. The cleanup process will look at each tar file and if it has > 25% garbage it will be cleaned up (a new generation is created containing only the relevant content, no garbage). The disconnect between the compaction and the cleanup can cause even a clean repo to grow (each new file has a fixed size of 256mb), so if compaction adds 256mb but the cleanup doesn't find anything useful, your repo will go up 256mb for no real reason. Over time this will stabilize, but the first time increase can be a bit unexpected. And the bigger the repository the bigger the increase. I'm proposing a solution to alleviate this problem. I'd like to first check if there is enough garbage in the repo to justify running compaction: check each tar file and if there's at least one that needs cleanup (>25% garbage) only then allow the compaction & cleanup to go through. This should stabilize the size of a repo that didn't change much since the last compaction run. I've created OAK-2019 to track this. Opinions are highly welcome! alex
