[
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-4200:
----------------------------------
Fix Version/s: (was: 1.5.4)
> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
> Key: OAK-4200
> URL: https://issues.apache.org/jira/browse/OAK-4200
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Reporter: Amit Jain
> Assignee: Amit Jain
> Fix For: 1.6
>
>
> The blob collection phase (Identifying all the blobs available in the data
> store) is quite an expensive part of the whole GC process, taking up a few
> hours sometimes on large repositories, due to iteration of the sub-folders in
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up
> that this phase can be faster if
> * Blobs ids are tracked when the blobs are added for e.g. in a simple file
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered more frequently. It would
> be ok to miss blob id additions to this file during a crash etc., as these
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered
> occasionally.
> We also may be able to track other metadata along with the blob ids like
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)