Hi,

I think removing binaries directly without going though the GC logic is
dangerous, because we can't be sure if there are other references. There
is one exception, it is if each file is guaranteed to be unique. For that,
we could for example append a unique UUID to each file. The Lucene file
system implementation would need to be changed for that (write the UUID,
but ignore it when reading and reading the file size).

Even in that case, there is still a risk, for example if the binary
_reference_ is copied, or if an old revision is accessed. How do we ensure
this does not happen?

Regards,
Thomas


On 10/03/15 07:46, "Chetan Mehrotra" <[email protected]> wrote:

>Hi Team,
>
>With storing of Lucene index files within DataStore our usage pattern
>of DataStore has changed between JR2 and Oak.
>
>With JR2 the writes were mostly application based i.e. if application
>stores a pdf/image file then that would be stored in DataStore. JR2 by
>default would not write stuff to DataStore. Further in deployment
>where large number of binary content is present then systems tend to
>share the DataStore to avoid duplication of storage. In such cases
>running Blob GC is a non trivial task as it involves a manual step and
>coordination across multiple deployments. Due to this systems tend to
>delay frequency of GC
>
>Now with Oak apart from application the Oak system itself *actively*
>uses the DataStore to store the index files for Lucene and there the
>churn might be much higher i.e. frequency of creation and deletion of
>index file is lot higher. This would accelerate the rate of garbage
>generation and thus put lot more pressure on the DataStore storage
>requirements.
>
>Any thoughts on how to avoid/reduce the requirement to increase the
>frequency of Blob GC?
>
>One possible way would be to provide a special cleanup tool which can
>look for such old Lucene index files and deletes them directly without
>going through the full fledged MarkAndSweep logic
>
>Thoughts?
>
>Chetan Mehrotra

Reply via email to