Hi, In addition or instead of using the BlobStore, we could store the Lucene index to the filesystem (persistence = file, path = ...).
But I would probably only do that on a case-by-case basis. I think it would reduce, but not solve the compaction problem. Some numbers from a test repository I have (not compacted): * 7 million segments in 3 tar files, of which are * 4.3 million (146 GB) data segments, and * 2.7 million (187 GB) binary segments. For this case, using external blobs would at most reduce the repository size by around 60% (so 40%, 146 GB of data segments would still remain). This change might make it possible to more efficiently compact. But I'm not sure. Regards, Thomas On 04/09/14 13:25, "Chetan Mehrotra" <[email protected]> wrote: >Hi Team, > >Currently SegmentNodeStore does not uses BlobStore by default and >stores the binary data within data tar files. This has the goodness >that > >1. Backup is simpler - User just needs to backup segmentstore directory >2. No Blob GC - The RevisionGC would also delete the binary content and a > separate Blob GC need not be performed >3. Faster IO - The binary content would be fetched via memory mapped files > and hence might have better performance compared to streamed io. > >However of late we are seeing issue where repository is not able to >reclaim space from deleted binary content as part of normal cleanup >and full scale compaction needs to be performed to reclaim the space. >However running compaction has other issue (see OAK-2045) and >currently it needs to be performed offline to get optimum results. > >In quite a few cases it has been see that repository growth is mostly >due to Lucene index content changes which leads to creation of new >binary content and also causes fragmentation due to newer revisions. >Further as Segment logic does not perform de duplication any change in >Lucene index file would probably re create the whole index file (need >to confirm). > >Given that such repository growth is troublesome it might be better if >we configure a BlobStore by default with SegmentNodeStore (or atleast >for applications like AEM). This should reduce the rate of repository >growth due to > >1. De duplication - BlobStore and DataStore (current impls) implement >de duplication so adding same binary would not cause size growth > >2. Lesser Fragmentation - As large binary content would not be part of >data tar files Blob GC would be able to reclaim space. Currently >in a cleanup if even one bulk segment in a data tar is having a >reference the cleanup would not be able to remove that. That space can >only be reclaimed via compaction. > >Compared to benefits mentioned initially > >1. Backup - User needs to backup two folders >2. Blob GC would need to be run separately >3. Faster IO - That needs to be seen. For Lucene this can be mitigated >to an extent with proposed CopyOnReadDirectory support in OAK-1724 > >Further we also get the benefit of sharing the BlobStore between >multiple instances if required!! > >Thoughts? > >Chetan Mehrotra
