[
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006029#comment-16006029
]
Thomas Mueller commented on OAK-2808:
-------------------------------------
[~chetanm] I'm afraid I don't understand your points.
> No change in current datastore GC is to be done
Well, we need to do active deletion in the datastore at least. So far I assumed
we keep references to binaries as blob ids, but now you wrote they need to be
stored as strings... If _all_ blob ids in the Lucene index are stored as
strings, then regular datastore GC will remove all Lucene binaries every time
it is run... So either _some_ of the Lucene blobs need to be stored as blob
references (in which case binaries will not be removed early), or we need to
change datastore GC so those latest Lucene binaries are not removed.
> The problem there is complexity introduced in index logic to ensure that
> index remains consistent wrt repository state
I'm afraid I don't understand the problem you refer to in the above timeline of
events.
> NRT indexes do not index binary content
OK, what is the reason for this, and could we change that? If yes, then
according to my (maybe too simple) logic we could just reduce the frequency of
the regular index updates, and use NRT for the time between regular indexing
updates.
> Active deletion of 'deleted' Lucene index files from DataStore without
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
> Key: OAK-2808
> URL: https://issues.apache.org/jira/browse/OAK-2808
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Thomas Mueller
> Labels: datastore, performance
> Fix For: 1.8
>
> Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)