[ 
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000109#comment-16000109
 ] 

Vikas Saurabh commented on OAK-2808:
------------------------------------

While looking at working at a patch for this, it seems that we already have 
some code to actively "note" deleted files via 
[r1705054|https://svn.apache.org/r1705054] which creates 
{{<oak_dir>/:trash/run_<N>}} nodes. {{run_<N>}} files also have current time of 
system. The code is default deactivated and can be controlled by a property on 
index def node.

[~chetanm], existing code, afaics, already handles the concern you had with 
storing on the repo. Maybe, we should extend this than doing the local file 
idea.

So, to me it seems that the tying up part is to also node complete execution of 
revisionGC/compaction as that is definitely a horizon of revisions that a 
repository can be rolled back to. A follow up of rev gc could be another job 
that reads {{run_<N>}} entries and deletes binaries from blob store if stored 
timestamp is comfortably \[0] behind rev gc.

Afaics, {{RevisionGC}} can be hooked up to note execution completion and the 
follow up job should be fairly straightforward.

[~tmueller], about:
bq. and only persist the index (to the repository) every x minutes
I feel that maybe we try this approach first as it's fairly simple and if it 
sheds the load off of blob store (which it should afaics) then we won't have to 
try the delayed committed index udpates (which might have their own can of 
works in actual implementation).

[~tmueller], [~chetanm] does this look ok to move forward?

\[0]:
Since there won't be a good way to have be tightly bound clocks around which 
system stores {{run_<N>}} entries or rev gc timestamp and the system that runs 
active cleanup. So, I think there should be a comfortable margin.

> Active deletion of 'deleted' Lucene index files from DataStore without 
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.8
>
>         Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to