[ 
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994345#comment-15994345
 ] 

Thomas Mueller commented on OAK-2808:
-------------------------------------

I'm not sure if that's a big problem, but it is a limitation: If blobs are 
removed after 30 minutes, and then one tries to rollback the repository state 
to before that (for the segment store, truncate the journal to what it was 31 
minutes ago), the index would become corrupt.

An safer alternative to active deletion is to run revision garbage collection / 
segment store compaction, followed by datastore garbage collection.

A second safer alternative would be to upload less binaries to the datastore. 
Don't we have that already with NRT indexes (where each cluster node indexes 
the repository state itself)? If not, could we change NRT indexes so this is 
possible? Basically, run the regular index update (which stores binaries in the 
datastore) much less frequently, for example just once per 5 minutes, but let 
each cluster node update the local index (in the file system) for example 
update every 10 seconds?

> Active deletion of 'deleted' Lucene index files from DataStore without 
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.8
>
>         Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to