[ 
https://issues.apache.org/jira/browse/OAK-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169164#comment-15169164
 ] 

Alex Parvulescu commented on OAK-3362:
--------------------------------------

an update on the GC Deltas, and a new issue I ran into.

* GC Deltas: as it turns out if you consider the checkpoints as snapshots in 
time from the gc reference, up to the current head, ordered by creation time, 
you no longer need to have incremental diffs between all revisions, you can 
just diff by intervals and you'll get a close enough estimation of garbage. 
To further explain the point: given [ref, cp0, cp1, head], where the _ref_ is 
the revision where compaction ran last, _cp0_ and _cp1_ are the removed 
checkpoints (we're effectively ignoring added cps), and _head_ is the current 
head state, if you only diff [ref, head] you can miss out of some intermediary 
updates on the same path (think indexing). as it turns out, a much better 
estimation of garbage is simply splitting the large diff over intervals: 
diff[ref, cp0] + diff[cp0, cp1] + diff [cp1, head]. it is still an estimation 
but I think it is good enough.

* The issue that comes up next is what happens when the _ref_ state, represents 
a compaction run that was not efficient, meaning there's still garbage left 
(lots of inmemory refs that can't be cleared and such). in this case the delta 
will only estimate garbage _since_ that revision, so it might not reflect a 
very good state. I can't yet tell if this will be a problem in real life or not.

> Estimate compaction based on diff to previous compacted head state
> ------------------------------------------------------------------
>
>                 Key: OAK-3362
>                 URL: https://issues.apache.org/jira/browse/OAK-3362
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: segmentmk
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>              Labels: compaction, gc
>             Fix For: 1.6
>
>
> Food for thought: try to base the compaction estimation on a diff between the 
> latest compacted state and the current state.
> Pros
> * estimation duration would be proportional to number of changes on the 
> current head state
> * using the size on disk as a reference, we could actually stop the 
> estimation early when we go over the gc threshold.
> * data collected during this diff could in theory be passed as input to the 
> compactor so it could focus on compacting a specific subtree
> Cons
> * need to keep a reference to a previous compacted state. post-startup and 
> pre-compaction this might prove difficult (except maybe if we only persist 
> the revision similar to what the async indexer is doing currently)
> * coming up with a threshold for running compaction might prove difficult
> * diff might be costly, but still cheaper than the current full diff



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to