[
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020559#comment-16020559
]
Vikas Saurabh commented on OAK-2808:
------------------------------------
I've done some preliminary work here -
https://github.com/catholicon/jackrabbit-oak/commits/OAK-2808-active-lucene-binary-deletion.
[~chetanm], the initial state that you reviewed earlier is at
https://github.com/catholicon/jackrabbit-oak/commits/OAK-2808-active-lucene-binary-deletion-take1.
Apart from executor service, I think I've incorporated all the changes you
mentioned.
Things that need improvement:
* Use executor for file flush and subsequently some changes if we can flush
more quickly
* missing test and javadocs
* refactor/cleanup/renames???
>From feature pov, following pieces are still missing:
* API to get oldest safe timestamp from checkpoint (OAK-6227)
* Patch up purge-blobs call to some scheduled task
[~tmueller], can you please take a peek for early review on the direction.
Current state is a little rough and needs cleanup... but it should give you the
idea of how I'm implementing it.
PS: While the commits I've broken in are mostly distinct - but the usage of
those seems very cohesive to be broken into separate issues. But, I'm fairly ok
one way or another wrt to multiple issues/sub-tasks.
> Active deletion of 'deleted' Lucene index files from DataStore without
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
> Key: OAK-2808
> URL: https://issues.apache.org/jira/browse/OAK-2808
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Thomas Mueller
> Labels: datastore, performance
> Fix For: 1.8
>
> Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)