[ 
https://issues.apache.org/jira/browse/OAK-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167023#comment-15167023
 ] 

Alex Parvulescu commented on OAK-3362:
--------------------------------------

to provide an update on this issue, I ran into some problems implementing the 
GC estimation delta:
* what happens when there are multiple updates on the same node between the 2 
revisions. in the case of indexing (lucene index for ex) the same binaries are 
updated over and over creating garbage that will not be seen by simply diffing 
the 2 ends of the revision intreval, we need to go fine grained to really count 
all garbage.
* checkpoints! creating at least one every 5 seconds makes for an incredibly 
expensive diff? I'm taking an approach where I try to run the content root diff 
separately and then investigate checkpoint situation, basically ignore the 
added cps and evaluate the deleted cp only, and that's how I ran into problem 
1, listed above.

it looks more and more like this should be an incremental revision walkthrough: 
start at the reference rev and incrementally work up to the current head, 
looking at deleted content and counting up the garbage. this will be a lot 
simpler to implement (only content diffs looking at deleted/changed nodes), but 
at this point I'm wondering how expensive will this be compared to the current 
situation.

> Estimate compaction based on diff to previous compacted head state
> ------------------------------------------------------------------
>
>                 Key: OAK-3362
>                 URL: https://issues.apache.org/jira/browse/OAK-3362
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: segmentmk
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>              Labels: compaction, gc
>             Fix For: 1.6
>
>
> Food for thought: try to base the compaction estimation on a diff between the 
> latest compacted state and the current state.
> Pros
> * estimation duration would be proportional to number of changes on the 
> current head state
> * using the size on disk as a reference, we could actually stop the 
> estimation early when we go over the gc threshold.
> * data collected during this diff could in theory be passed as input to the 
> compactor so it could focus on compacting a specific subtree
> Cons
> * need to keep a reference to a previous compacted state. post-startup and 
> pre-compaction this might prove difficult (except maybe if we only persist 
> the revision similar to what the async indexer is doing currently)
> * coming up with a threshold for running compaction might prove difficult
> * diff might be costly, but still cheaper than the current full diff



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to