[
https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875870#comment-15875870
]
Julian Reschke edited comment on OAK-4780 at 2/21/17 12:46 PM:
---------------------------------------------------------------
Here's an approach that might be simpler but in the end achieves the same goal:
- set a limit for the collection phase, both for elapsed time and # of documents
- when limit reached, sort the collected IDs by modified date, and compute a
new upper limit so that half of the documents become out of range; throw these
entries away
- continue the collection with the smaller time window (this just needs an
internal API that allows to specify the _id to start with and assumes that the
query returns documents sorted by {{_id}})
- compute new limit for elapsed time (half of the original?)
Eventually, we should have a set of documents that we *can* garbage collect.
Finally, if maintenance window still open, just rerun the GC again.
was (Author: reschke):
Here's an approach that might be simpler but in the end achieves the same goal:
- set a limit for the collection phase, both for elapsed time and # of documents
- when limit reached, sort the collected IDs by modified date, and compute a
new upper limit so that half of the documents become out of range; throw these
entries away
- continue the collection with the smaller time window (this just needs an
internal API that allows to specify the _id to start with)
- compute new limit for elapsed time (half of the original?)
Eventually, we should have a set of documents that we *can* garbage collect.
Finally, if maintenance window still open, just rerun the GC again.
> VersionGarbageCollector should be able to run incrementally
> -----------------------------------------------------------
>
> Key: OAK-4780
> URL: https://issues.apache.org/jira/browse/OAK-4780
> Project: Jackrabbit Oak
> Issue Type: Task
> Components: core, documentmk
> Reporter: Julian Reschke
> Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
>
>
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been
> successfully finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is
> interrupted during the path collection phase, maybe due to other maintenance
> tasks. On the next run, the number of paths to be collected will be even
> bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in
> chunks; maybe by partitioning the path space by top level directory.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)