[ 
https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15978522#comment-15978522
 ] 

Chetan Mehrotra commented on OAK-2808:
--------------------------------------

h4. Minor BlobGC for lucene blobs

In this approach we 

# Record in a file on filesystem ids (and some more metadata) for each deleted 
blob as part of async indexing. 
#* The file is stored in <index root dir>/deleted-blobs/blobs-<date>.txt
#* Each line has entry for 
#** blobid
#** index path
# Have a periodic job (say 2-3 times a day). 
## Which would find the time of the oldest checkpoint 
## Then it iterates over the file and for all deleted blobs older than the time 
of last checkpoint are considered candidate for deletion
## Such blobs are then deleted using BlobStore API

This approach works on 
# Best effort basis - If the leader node where indexing was happening dies for 
some reason then blobs recorded by it are not actively deleted and would be 
taken care by normal blob gc process
# Remember that each blobId is unique as we are already appending a random byte 
sequence so it would be safe to delete such files without checking if they are 
reffered again or not

h5. Potential Mark Phase

It may happen (in theory) that a blob which is deleted is resurrected again by 
some repo operation which reverted repo state. To handle such cases we can have 
an extra mark phase where we
# We go over all lucene index definition and mark the blobids 
# Then see if any of the blobs is referred in deleted list or not. If referred 
then those blobs would not be deleted 

[~amitj_76] [~catholicon] [~tmueller] Thoughts?


> Active deletion of 'deleted' Lucene index files from DataStore without 
> relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.8
>
>         Attachments: copyonread-stats.png, OAK-2808-1.patch
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to