[jira] [Commented] (OAK-2808) Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

Chetan Mehrotra (JIRA) Tue, 02 May 2017 23:53:16 -0700

    [ 
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994362#comment-15994362
 ]


Chetan Mehrotra commented on OAK-2808:
--------------------------------------

bq. If blobs are removed after 30 minutes, and then one tries to rollback the 
repository state to before that (for the segment store, truncate the journal to 
what it was 31 minutes ago), the index would become corrupt.

Yes that would be the case but then only guarantee we make with repository 
consistency at older revision is one where state is accessed based on some 
checkpoint. Any state beyond last valid checkpoint may not be consistent. If 
this is required then this feature can be disabled on those setups

bq. An safer alternative to active deletion is to run revision garbage 
collection / segment store compaction, followed by datastore garbage collection.

Thats the current state. However current cycle is set at weekly. And given 
DataStore GC for large repository takes time (mark phase on DocumentNodeStore 
specially and deletion in S3 is slow) its not feasible to run them frequently. 
May be for SegmentNodeStore with online gc we can run it frequently but that 
would need to be validated.

bq. Basically, run the regular index update (which stores binaries in the 
datastore) much less frequently, for example just once per 5 minutes

That would be tricky. If we make indexing less frequent then each cycle diff 
would take longer and it can lead to that indexing cycle take much longer 
causing the next cycle to take more time. We have seen this in few setups where 
if indexing starts lagging behind once and system is in active use then there 
is a high change of indexing lag becoming larger and larger (/cc 
[~stefan.eissing])

> Active deletion of 'deleted' Lucene index files from DataStore without 
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.8
>
>         Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (OAK-2808) Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

Reply via email to