[GitHub] [jackrabbit-oak] rishabhdaim commented on a diff in pull request #956: OAK-10199 : Added skelton to perform DetailedGC

via GitHub Tue, 13 Jun 2023 10:56:03 -0700


rishabhdaim commented on code in PR #956:
URL: https://github.com/apache/jackrabbit-oak/pull/956#discussion_r1228502556



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +586,87 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            long oldestModifiedGCed = rec.scopeFullGC.fromMs;
+            try (DetailedGC gc = new DetailedGC(headRevision, monitor, 
cancel)) {
+                final long fromModified = rec.scopeFullGC.fromMs;
+                final long toModified = rec.scopeFullGC.toMs;
+                if (phases.start(GCPhase.DETAILED_GC)) {
+                    while (foundDoc && oldestModifiedGCed < toModified && 
docsTraversed <= PROGRESS_BATCH_SIZE) {
+                        // set foundDoc to false to allow exiting the while 
loop
+                        foundDoc = false;
+                        Iterable<NodeDocument> itr = 
versionStore.getModifiedDocs(oldestModifiedGCed, toModified, 1000);
+                        try {
+                            for (NodeDocument doc : itr) {

Review Comment:
   > There is an additional complication where if there are more than 1000 
documents with all equal "_modified", the current code wouldn't get passed 
these 1000 documents, as it only uses "_modified" as the starting condition but 
stops after 1000. So if there's more than 1000 it currently can't deliver the 
1001st, 1002nd etc.
   
   I see one more problem here in case if none of the 1000 documents has any 
garbage, then it will get stuck here infinitely. I would propose using `_id` 
field in the search criteria of `getModifiedDocs` with `>` conditions and 
getting results sorted on the basis of `_id`.
   
   We should store the `_id` of the last processed document and save it in the 
`settings` collection along with `detailedGCTimeStamp`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [jackrabbit-oak] rishabhdaim commented on a diff in pull request #956: OAK-10199 : Added skelton to perform DetailedGC

Reply via email to