Hi,

In addition or instead of using the BlobStore, we could store the Lucene
index to the filesystem (persistence = file, path = ...).

But I would probably only do that on a case-by-case basis. I think it
would reduce, but not solve the compaction problem. Some numbers from a
test repository I have (not compacted):

* 7 million segments in 3 tar files, of which are
* 4.3 million (146 GB) data segments, and
* 2.7 million (187 GB) binary segments.

For this case, using external blobs would at most reduce the repository
size by around 60% (so 40%, 146 GB of data segments would still remain).
This change might make it possible to more efficiently compact. But I'm
not sure.

Regards,
Thomas




On 04/09/14 13:25, "Chetan Mehrotra" <[email protected]> wrote:

>Hi Team,
>
>Currently SegmentNodeStore does not uses BlobStore by default and
>stores the binary data within data tar files. This has the goodness
>that
>
>1. Backup is simpler - User just needs to backup segmentstore directory
>2. No Blob GC - The RevisionGC would also delete the binary content and a
>    separate Blob GC need not be performed
>3. Faster IO - The binary content would be fetched via memory mapped files
>    and hence might have better performance compared to streamed io.
>
>However of late we are seeing issue where repository is not able to
>reclaim space from deleted binary content as part of normal cleanup
>and full scale compaction needs to be performed to reclaim the space.
>However running compaction has other issue (see OAK-2045) and
>currently it needs to be performed offline to get optimum results.
>
>In quite a few cases it has been see that repository growth is mostly
>due to Lucene index content changes which leads to creation of new
>binary content and also causes fragmentation due to newer revisions.
>Further as Segment logic does not perform de duplication any change in
>Lucene index file would probably re create the whole index file (need
>to confirm).
>
>Given that such repository growth is troublesome it might be better if
>we configure a BlobStore by default with SegmentNodeStore (or atleast
>for applications like AEM). This should reduce the rate of repository
>growth due to
>
>1. De duplication - BlobStore and DataStore (current impls) implement
>de duplication so adding same binary would not cause size growth
>
>2. Lesser Fragmentation - As large binary content would not be part of
>data tar files Blob GC would     be able to reclaim space. Currently
>in a cleanup if even one bulk segment in a data tar is having a
>reference the cleanup would not be able to remove that. That space can
>only be reclaimed via compaction.
>
>Compared to benefits mentioned initially
>
>1. Backup - User needs to backup two folders
>2. Blob GC would need to be run separately
>3. Faster IO - That needs to be seen. For Lucene this can be mitigated
>to an extent with proposed CopyOnReadDirectory support in OAK-1724
>
>Further we also get the benefit of sharing the BlobStore between
>multiple instances if required!!
>
>Thoughts?
>
>Chetan Mehrotra

Reply via email to