[jira] [Updated] (OAK-4200) [BlobGC] Improve collection times of blobs available

Davide Giannella (JIRA) Thu, 16 Jun 2016 02:02:58 -0700

     [ 
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davide Giannella updated OAK-4200:
----------------------------------
    Fix Version/s:     (was: 1.5.4)

> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
>                 Key: OAK-4200
>                 URL: https://issues.apache.org/jira/browse/OAK-4200
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>             Fix For: 1.6
>
>
> The blob collection phase (Identifying all the blobs available in the data 
> store) is quite an expensive part of the whole GC process, taking up a few 
> hours sometimes on large repositories, due to iteration of the sub-folders in 
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up 
> that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file 
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to 
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would 
> be ok to miss blob id additions to this file during a crash etc., as these 
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered 
> occasionally.
> We also may be able to track other metadata along with the blob ids like 
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4200) [BlobGC] Improve collection times of blobs available

Reply via email to