[ 
https://issues.apache.org/jira/browse/OAK-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924046#comment-17924046
 ] 

Horia Poradici commented on OAK-11444:
--------------------------------------

>From looking in the code I see that some documents in the SETTINGS collection 
>are used in Oak by:
 * {*}{*}Sweep2 DocumentNodeStore background thread status and lock management: 
reading and updating sweep2 status and locks - check Sweep2StatusDocument.java
 * {*}{*}VersionGarbageCollector timestamp storing and reading while fullGC is 
running - the versionGC document

We can try saving deleted documents and property information under SETTINGS/bin 
(path is just a suggestion), but we should test that performance of the sweep2 
background thread is not affected.

One suggestion for performance testing is creating a lot of garbage using the 
oak-run tool and then running fullGC with various settings for batch sizes and 
delayFactor.
Example of command for generating garbage of type GAP_ORPHANS:



java -jar oak-run.jar create-test-garbage "[mongoURI]" create 
--garbageNodesCount 10000 --garbageType 2 --garbageNodesParentCount 1000 
--generateGarbageBatchDelaySeconds 1 --generateGarbageBatchSize 1000

> [full-gc] Save document id and empty properties names before deletion 
> ----------------------------------------------------------------------
>
>                 Key: OAK-11444
>                 URL: https://issues.apache.org/jira/browse/OAK-11444
>             Project: Jackrabbit Oak
>          Issue Type: Story
>          Components: mongomk
>            Reporter: Daniel Iancu
>            Priority: Major
>
> Store document ID and empty properties names into a dedicated *_bin* 
> collection
> before physically deletion  from Mongo nodes collection during full gc.
> Motivation behind this change is that in case of accidentally deleting data 
> that should have not been deleted (not garbage) this `log` of removed 
> documents and properties will help the complete restoration from backup.
> A separate collection was preferred instead of logging to files because is 
> more reliable. Logs usually needs to be exported to platform like Splunk and 
> the process does not guarantee that all logs are saved. 
> The data saved in *_bin* collection is temporary, the cleaning can be done 
> via setting document TTL or by using an external job to remove it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to