[
https://issues.apache.org/jira/browse/OAK-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167023#comment-15167023
]
Alex Parvulescu commented on OAK-3362:
--------------------------------------
to provide an update on this issue, I ran into some problems implementing the
GC estimation delta:
* what happens when there are multiple updates on the same node between the 2
revisions. in the case of indexing (lucene index for ex) the same binaries are
updated over and over creating garbage that will not be seen by simply diffing
the 2 ends of the revision intreval, we need to go fine grained to really count
all garbage.
* checkpoints! creating at least one every 5 seconds makes for an incredibly
expensive diff? I'm taking an approach where I try to run the content root diff
separately and then investigate checkpoint situation, basically ignore the
added cps and evaluate the deleted cp only, and that's how I ran into problem
1, listed above.
it looks more and more like this should be an incremental revision walkthrough:
start at the reference rev and incrementally work up to the current head,
looking at deleted content and counting up the garbage. this will be a lot
simpler to implement (only content diffs looking at deleted/changed nodes), but
at this point I'm wondering how expensive will this be compared to the current
situation.
> Estimate compaction based on diff to previous compacted head state
> ------------------------------------------------------------------
>
> Key: OAK-3362
> URL: https://issues.apache.org/jira/browse/OAK-3362
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: segmentmk
> Reporter: Alex Parvulescu
> Assignee: Alex Parvulescu
> Priority: Minor
> Labels: compaction, gc
> Fix For: 1.6
>
>
> Food for thought: try to base the compaction estimation on a diff between the
> latest compacted state and the current state.
> Pros
> * estimation duration would be proportional to number of changes on the
> current head state
> * using the size on disk as a reference, we could actually stop the
> estimation early when we go over the gc threshold.
> * data collected during this diff could in theory be passed as input to the
> compactor so it could focus on compacting a specific subtree
> Cons
> * need to keep a reference to a previous compacted state. post-startup and
> pre-compaction this might prove difficult (except maybe if we only persist
> the revision similar to what the async indexer is doing currently)
> * coming up with a threshold for running compaction might prove difficult
> * diff might be costly, but still cheaper than the current full diff
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)