Francesco Mari created OAK-3536:
-----------------------------------

             Summary: Indexing with Lucene and copy-on-read generate too much 
garbage in the BlobStore
                 Key: OAK-3536
                 URL: https://issues.apache.org/jira/browse/OAK-3536
             Project: Jackrabbit Oak
          Issue Type: Bug
          Components: lucene
    Affects Versions: 1.3.9
            Reporter: Francesco Mari
            Priority: Critical


The copy-on-read strategy when using Lucene indexing performs too many copies 
of the index files from the filesystem to the repository. Every copy discards 
the previously stored binary, that sits there as garbage until the binary 
garbage collection kicks in. When the load on the system is particularly 
intense, this behaviour makes the repository grow at an unreasonable high pace. 

I spotted this on a system where some content is generated every day at a 
specific time. The content generation process creates approx. 6 millions new 
nodes, where each node contains 5 properties with small string, random values. 
Nodes were saved in batches of 1000 nodes each. At the end of the content 
generation process, the nodes are deleted to deliberately generate garbage in 
the Segment Store. This is part of a testing effort to assess the efficiency of 
the online compaction.

I was never able to complete the tests because the system run out of disk space 
due to a lot of unused binary values. When debugging the system, on a 400 GB 
(full) disk, the segments containing nodes and property values occupied approx. 
3 GB. The rest of the space was occupied by binary values in form of bulk 
segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to