[
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994362#comment-15994362
]
Chetan Mehrotra commented on OAK-2808:
--------------------------------------
bq. If blobs are removed after 30 minutes, and then one tries to rollback the
repository state to before that (for the segment store, truncate the journal to
what it was 31 minutes ago), the index would become corrupt.
Yes that would be the case but then only guarantee we make with repository
consistency at older revision is one where state is accessed based on some
checkpoint. Any state beyond last valid checkpoint may not be consistent. If
this is required then this feature can be disabled on those setups
bq. An safer alternative to active deletion is to run revision garbage
collection / segment store compaction, followed by datastore garbage collection.
Thats the current state. However current cycle is set at weekly. And given
DataStore GC for large repository takes time (mark phase on DocumentNodeStore
specially and deletion in S3 is slow) its not feasible to run them frequently.
May be for SegmentNodeStore with online gc we can run it frequently but that
would need to be validated.
bq. Basically, run the regular index update (which stores binaries in the
datastore) much less frequently, for example just once per 5 minutes
That would be tricky. If we make indexing less frequent then each cycle diff
would take longer and it can lead to that indexing cycle take much longer
causing the next cycle to take more time. We have seen this in few setups where
if indexing starts lagging behind once and system is in active use then there
is a high change of indexing lag becoming larger and larger (/cc
[~stefan.eissing])
> Active deletion of 'deleted' Lucene index files from DataStore without
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
> Key: OAK-2808
> URL: https://issues.apache.org/jira/browse/OAK-2808
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: lucene
> Reporter: Chetan Mehrotra
> Assignee: Thomas Mueller
> Labels: datastore, performance
> Fix For: 1.8
>
> Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)