Hi, We recently added a new "compaction" feature to the TarMK (see OAK-1804). This feature traverses the content tree and copies all non-bulk* content to new data segments. We do this for two main reasons:
a) The cleanup operation is unable to collect data segments with a mix of both reachable and unreachable content or indeed any segments referenced by such mixed segments, regardless of whether those segments have any reachable content. By copying all reachable content to new data segments, the compaction makes the previously mixed data segments and their references collectable by the cleanup operation. b) Commits over time can end up splintering related parts of the content tree over many small segments, which reduces locality of reference and makes caching less efficient. Compaction reverses this process and ensures that related content (same subtree) gets packed together in the compacted segments. The compaction feature works both offline with the oak-run tool and online with the FileStore.gc() method, and apart from a few smaller issues like OAK-1917 and OAK-1927 it generally works fine. However, based on some real-world usage I've identified one bigger issue that still needs solving: The compaction code already takes into account the chance of concurrent commits during compaction. Such commits already get automatically rebased to the compacted state to prevent them from referencing data segments from before the compaction. Unfortunately, since the compactor shares the SegmentWriter with normal repository updates, also the pre-rebased commits typically end up sharing the same segments with compacted content. This causes such segments to become troublesome mixed segments that still contain references to data segments from before compaction, and that thus prevent those older segments from being cleaned up. That problem should be solvable by using a separate SegmentWriter instance for the compaction. I'm looking at this now. *) Bulk content consists of binaries larger than 16kB. They get stored in bulk segments (or in a data store, if so configured), and just referenced from the tree structured stored in data segments. BR, Jukka Zitting
