[jira] [Issue Comment Deleted] (OAK-4200) [BlobGC] Improve collection times of blobs available

Amit Jain (JIRA) Thu, 16 Jun 2016 01:34:34 -0700

     [ 
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Amit Jain updated OAK-4200:
---------------------------
    Comment: was deleted

(was: Thanks [~tmueller] for the feedback.

bq. ConcurrentLinkedQueue<String> exceptionQueue: not sure why the concurrent 
variant is used.
Earlier, there were concurrent requests for delete which changed to a single 
request after it was found that it could cause connection starvation on mongo. 
So, this can be changed.

bq. CloseableFileIterator: "If the underlying file is provide then it deletes 
the file on close". That sounds strange. Why would one delete a file if 
iterating is done. At least the name of the class should indicated that 
("BurnAfterReadingLineIterator" or so).
The iterator can be returned over a potentially large temporary file, so, its 
better if its deleted as it would stick around for some time otherwise. The 
point of providing the file here is to indicate that it's no longer needed. So, 
I can change the name to better reflect this.

bq. "if (idsIter instanceof Closeable)": using instanceof sounds strange. I 
understand this is done in "finally" so that the file is closed in exceptional 
cases as well, but maybe just let the Java GC deal with closing the resources 
for those cases?
The idea was to clear resources as fast as possible. One possible case is the 
iterator returned as above (CloseableFileIterator).

{quote}
BlobCollectionType: not sure why this is an enum, I would probably uses classes 
instead.
Not sure what the differences are between "DEFAULT" and "TRACKER". Should be 
documented.
{quote}
In hindsight the name should be a noun (e.g. BlobCollector). I thought it's a 
little less clutter when using an enum as it represents 2 related types with 
some specialized behavior.
Yes, I'll document it more clearly. {{DEFAULT}} & {{TRACKER}} distinguish cases 
when the data store is tracking ids locally or not.)

> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
>                 Key: OAK-4200
>                 URL: https://issues.apache.org/jira/browse/OAK-4200
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>             Fix For: 1.6, 1.5.4
>
>
> The blob collection phase (Identifying all the blobs available in the data 
> store) is quite an expensive part of the whole GC process, taking up a few 
> hours sometimes on large repositories, due to iteration of the sub-folders in 
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up 
> that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file 
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to 
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would 
> be ok to miss blob id additions to this file during a crash etc., as these 
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered 
> occasionally.
> We also may be able to track other metadata along with the blob ids like 
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (OAK-4200) [BlobGC] Improve collection times of blobs available

Reply via email to