[
https://issues.apache.org/jira/browse/OAK-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924046#comment-17924046
]
Horia Poradici commented on OAK-11444:
--------------------------------------
>From looking in the code I see that some documents in the SETTINGS collection
>are used in Oak by:
* {*}{*}Sweep2 DocumentNodeStore background thread status and lock management:
reading and updating sweep2 status and locks - check Sweep2StatusDocument.java
* {*}{*}VersionGarbageCollector timestamp storing and reading while fullGC is
running - the versionGC document
We can try saving deleted documents and property information under SETTINGS/bin
(path is just a suggestion), but we should test that performance of the sweep2
background thread is not affected.
One suggestion for performance testing is creating a lot of garbage using the
oak-run tool and then running fullGC with various settings for batch sizes and
delayFactor.
Example of command for generating garbage of type GAP_ORPHANS:
java -jar oak-run.jar create-test-garbage "[mongoURI]" create
--garbageNodesCount 10000 --garbageType 2 --garbageNodesParentCount 1000
--generateGarbageBatchDelaySeconds 1 --generateGarbageBatchSize 1000
> [full-gc] Save document id and empty properties names before deletion
> ----------------------------------------------------------------------
>
> Key: OAK-11444
> URL: https://issues.apache.org/jira/browse/OAK-11444
> Project: Jackrabbit Oak
> Issue Type: Story
> Components: mongomk
> Reporter: Daniel Iancu
> Priority: Major
>
> Store document ID and empty properties names into a dedicated *_bin*
> collection
> before physically deletion from Mongo nodes collection during full gc.
> Motivation behind this change is that in case of accidentally deleting data
> that should have not been deleted (not garbage) this `log` of removed
> documents and properties will help the complete restoration from backup.
> A separate collection was preferred instead of logging to files because is
> more reliable. Logs usually needs to be exported to platform like Splunk and
> the process does not guarantee that all logs are saved.
> The data saved in *_bin* collection is temporary, the cleaning can be done
> via setting document TTL or by using an external job to remove it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)