Stefan Egli created OAK-10688:
---------------------------------

             Summary: Keep only traversed state, remove all other revisions
                 Key: OAK-10688
                 URL: https://issues.apache.org/jira/browse/OAK-10688
             Project: Jackrabbit Oak
          Issue Type: Task
          Components: documentmk
            Reporter: Stefan Egli
            Assignee: Stefan Egli


As a slightly different algorithm to OAK-10535 this ticket suggests to 
calculate the traversedState of a node, then keeps only those revisions needed 
for that traversedState and removes all others. The main difference is an 
inversion of logic, where instead of analysing for each revision whether it 
must be kept or not - this first derives the revision that must be "kept" from 
the traversedState - then deletes all others.

This mechanism applies to all (normal and bundled) properties as well as some 
DocumentNodeStore internal ones, such as "_deleted".

Below are a list of assumptions to back this:
* DetailedGC runs only up to the older between the oldest checkpoint and 
maxRevisionAge (24h by default). Thus a document analysed by DetailedGC is 
guaranteed to have only 1 revision (per property) that must be kept - as it is 
guaranteed to not have modifications (revisions) younger than any checkpoint or 
maxRevisionAge (24h)
* To find out which revision(s) must be kept, the node tree is traversed from 
root (based on current head revision) to the target document.
* Given the first bullet (that we're only looking at nodes that have only 1 
revision (each, per property) to keep, this traversed node state thus contains 
the values of those.
* Hence, based on each of the property key of the traversed state, the 
corresponding "commit revision" in the document-local map must be calculated. 
That local map entry must be kept - all others can be deleted.
* Note that this also cleans up overwritten branch commits of the same branch 
(as only the last, relevant one is kept)

As a result of the above, certain other entries can be deleted, namely:
* any "_commitRoot" entry no longer referenced by the local document
* any "_bc" entry no longer referenced by the local document

Independent of the traversedState and the outcome of the cleanup what can also 
be removed is:
* any "_revisions" entry older than the current sweepRev

However: "_revisions" entry that might not be referenced by the local document 
and are younger than the sweepRev must still be kept, as they might be 
referenced by child documents (through their "_commitRoot" pointing to the 
current document). Without checking for children and double-checking the actual 
use, there could as a result still be some garbage "_revisions" entries left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to