[
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004750#comment-16004750
]
Thomas Mueller commented on OAK-2808:
-------------------------------------
> We would need to save them as strings of blobids.
Active deletion requires many workarounds:
* store blob ids as Strings instead of as blob ids
* changes to the datastore GC are needed to _collect_ some blobs
* changed to the datastore GC are needed to _retain_ other blobs
* can not support rollback to an old revision
We already have datastore GC, and doing "our own" way to ensure binaries are
not collected at point x, but at point y, and so on,... thats duplicating that
work.
If possible I would try to simplify the solution. If "only persist the index
(to the repository) every x minutes" can simplify it, then I think we should
try that instead. That would have the following benefits:
* possibly no need to disable the "write directly to S3 instead of using a
write-back cache"
* reduces writes to the datastore, which reduces S3 cost, write bandwidth
* reduces reads from the datastore
* completely avoids any risks to delete required binaries in the datastore
(which is always a huge problem)
* no need to use all those "special" workarounds
* probably less complicated solution than active deletion
But let's discuss that. [~chetanm] which disadvantages do you see for "only
persist the index (to the repository) every x minutes"?
> Active deletion of 'deleted' Lucene index files from DataStore without
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
> Key: OAK-2808
> URL: https://issues.apache.org/jira/browse/OAK-2808
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Thomas Mueller
> Labels: datastore, performance
> Fix For: 1.8
>
> Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)