[ https://issues.apache.org/jira/browse/OAK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829210#comment-17829210 ]
Stefan Egli commented on OAK-10688: ----------------------------------- * previous branch was discarded * new branch created instead : https://github.com/apache/jackrabbit-oak/tree/OAK-10688-rebase * copied change from original branch with further changes into the new branch (in [this commit|https://github.com/apache/jackrabbit-oak/commit/40225c85a6c0784ea120ffe1b4aaa486c50fecc8]) * created a PR with that [here|https://github.com/apache/jackrabbit-oak/pull/1372] > Keep only traversed state, remove all other revisions > ----------------------------------------------------- > > Key: OAK-10688 > URL: https://issues.apache.org/jira/browse/OAK-10688 > Project: Jackrabbit Oak > Issue Type: Task > Components: documentmk > Reporter: Stefan Egli > Assignee: Stefan Egli > Priority: Major > > As a slightly different algorithm to OAK-10535 this ticket suggests to > calculate the traversedState of a node, then keeps only those revisions > needed for that traversedState and removes all others. The main difference is > an inversion of logic, where instead of analysing for each revision whether > it must be kept or not - this first derives the revision that must be "kept" > from the traversedState - then deletes all others. > This mechanism applies to all (normal and bundled) properties as well as some > DocumentNodeStore internal ones, such as "_deleted". > Below are a list of assumptions to back this: > * DetailedGC runs only up to the older between the oldest checkpoint and > maxRevisionAge (24h by default). Thus a document analysed by DetailedGC is > guaranteed to have only 1 revision (per property) that must be kept - as it > is guaranteed to not have modifications (revisions) younger than any > checkpoint or maxRevisionAge (24h) > * To find out which revision(s) must be kept, the node tree is traversed from > root (based on current head revision) to the target document. > * Given the first bullet (that we're only looking at nodes that have only 1 > revision (each, per property) to keep, this traversed node state thus > contains the values of those. > * Hence, based on each of the property key of the traversed state, the > corresponding "commit revision" in the document-local map must be calculated. > That local map entry must be kept - all others can be deleted. > * Note that this also cleans up overwritten branch commits of the same branch > (as only the last, relevant one is kept) > As a result of the above, certain other entries can be deleted, namely: > * any "_commitRoot" entry no longer referenced by the local document > * any "_bc" entry no longer referenced by the local document > Independent of the traversedState and the outcome of the cleanup what can > also be removed is: > * any "_revisions" entry older than the current sweepRev > However: "_revisions" entry that might not be referenced by the local > document and are younger than the sweepRev must still be kept, as they might > be referenced by child documents (through their "_commitRoot" pointing to the > current document). Without checking for children and double-checking the > actual use, there could as a result still be some garbage "_revisions" > entries left. -- This message was sent by Atlassian Jira (v8.20.10#820010)