[ https://issues.apache.org/jira/browse/OAK-4826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529188#comment-15529188 ]
Marcel Reutegger commented on OAK-4826: --------------------------------------- The problem is that the failing test interleaves two async index update calls in a specific way and expects it to be successful. The first index update is triggered and creates a checkpoint C1, but before it proceeds with the actual index update and persisting of the reference checkpoint, another index update is triggered that completes with a checkpoint C2. It then considers C1 as orphaned because he created timestamp of C1 is older than C2. I'm not sure if this is really a valid test case. Shouldn't C1 be protected with the lease mechanism? I will change the cleanup implementation in any case to protect against races when a checkpoint is created and later persisted as a reference checkpoint within the lease time frame. > Auto removal of orphaned checkpoints > ------------------------------------ > > Key: OAK-4826 > URL: https://issues.apache.org/jira/browse/OAK-4826 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core > Reporter: Chetan Mehrotra > Assignee: Marcel Reutegger > Labels: candidate_oak_1_4 > Fix For: 1.6, 1.5.11 > > Attachments: OAK-4826.patch, OAK-4826.patch, OAK-4826.patch > > > Currently if in a running system there are some orphaned checkpoint present > then they prevent the revision gc (compaction for segment) from being > effective. > So far the practice has been to use {{oak-run checkpoints rm-unreferenced}} > command to clean them up manually. This was set to manual as it was not > possible to determine whether current checkpoint is in use or not. > rm-unreferenced works with the basis that checkpoints are only made from > AsyncIndexUpdate and hence can check if the checkpoint is in use by cross > checking with {{:async}} state. Doing it in auto mode is risky as > {{checkpoint}} api can be used by any module. > With OAK-2314 we also record some metadata like {{creator}} and {{name}}. > This can be used for auto cleanup. For example in some running system > following checkpoints are listed > {noformat} > Mon Sep 19 18:02:09 EDT 2016 Sun Jun 16 18:02:09 EDT 2019 > r15744787d0a-1-1 > > creator=AsyncIndexUpdate > name=fulltext-async > thread=sling-default-4070-Registered Service.653 > > Mon Sep 19 18:02:09 EDT 2016 Sun Jun 16 18:02:09 EDT 2019 > r15744787d0a-0-1 > > creator=AsyncIndexUpdate > name=async > thread=sling-default-4072-Registered Service.656 > > Fri Aug 19 18:57:33 EDT 2016 Thu May 16 18:57:33 EDT 2019 > r156a50612e1-1-1 > > creator=AsyncIndexUpdate > name=async > thread=sling-default-10-Registered Service.654 > > Wed Aug 10 12:13:20 EDT 2016 Tue May 07 12:25:52 EDT 2019 > r156753ac38d-0-1 > > creator=AsyncIndexUpdate > name=async > thread=sling-default-6041-Registered Service.1966 > {noformat} > As can be seen that last 2 checkpoints are orphan and they would prevent > revision gc. For auto mode we can use following heuristic > # List all current checkpoints > # Only keep the latest checkpoint for given {{creator}} and {{name}} combo. > Other entries from same pair which are older i.e. creation time can be > consider orphan and deleted > This logic can be implemented > {{org.apache.jackrabbit.oak.checkpoint.Checkpoints}} and can be invoked by > Revision GC logic (both in DocumentNodeStore and SegmentNodeStore) to > determine the base revision to keep -- This message was sent by Atlassian JIRA (v6.3.4#6332)