stefan-egli commented on code in PR #993:
URL: https://github.com/apache/jackrabbit-oak/pull/993#discussion_r1274942619


##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCSupport.java:
##########
@@ -160,6 +203,29 @@ public long getOldestDeletedOnceTimestamp(Clock clock, 
long precisionMs) {
         return ts;
     }
 
+    /**
+     * Retrieve the oldest modified document.
+     *
+     * @return the oldest modified document.
+     */
+    public Optional<NodeDocument> getOldestModifiedDoc(final Clock clock) {
+        long ts = 0;
+        long now = clock.getTime();
+        Iterable<NodeDocument> docs = null;
+
+        LOG.info("find oldest modified document");

Review Comment:
   Noticed in several places in the PR that there are `LOG.info` statements 
(some replacing `debug`). I do agree we should have a reasonable level of 
detail logging by default, so `info` is definitely good in some cases. But I'm 
not sure whether some of these `info` are not leftovers or are indeed here on 
purpose. This one for example doesn't log any parameters, so at least that 
should be added. But it also might be somewhat noisy?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/mongo/MongoVersionGCSupport.java:
##########
@@ -120,6 +131,40 @@ public CloseableIterable<NodeDocument> 
getPossiblyDeletedDocs(final long fromMod
                 input -> store.convertFromDBObject(NODES, input)));
     }
 
+    /**
+     * Returns documents that have a {@link NodeDocument#MODIFIED_IN_SECS} 
value
+     * within the given range and are greater than given @{@link 
NodeDocument#ID}.
+     * <p>
+     * The two passed modified timestamps are in milliseconds
+     * since the epoch and the implementation will convert them to seconds at
+     * the granularity of the {@link NodeDocument#MODIFIED_IN_SECS} field and
+     * then perform the comparison.
+     * <p/>
+     *
+     * @param fromModified the lower bound modified timestamp (inclusive)
+     * @param toModified   the upper bound modified timestamp (exclusive)

Review Comment:
   Noticed that we're not going into details as to whether this is seconds or 
milliseconds in other places of this class neither. But with new code, I was 
wondering whether we shouldn't just state here that it is milliseconds. Makes 
it a bit more explicit and reader friendly.



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         * @param rec {@link VersionGCRecommendations} to recommend GC 
operation
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+            final long toModified = rec.scopeDetailedGC.toMs;
+            long oldModifiedMs = oldestModifiedMs;
+            final String oldestModifiedDocId = rec.detailedGCId;

Review Comment:
   I think moving all the `final` variables here to the top of the method might 
make it a bit more readable..



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCRecommendations.java:
##########
@@ -73,35 +89,57 @@ public class VersionGCRecommendations {
      * @param vgc VersionGC support class
      * @param options options for running the gc
      * @param gcMonitor monitor class for messages
+     * @param detailedGCEnabled whether detailedGC is enabled or not
      */
     public VersionGCRecommendations(long maxRevisionAgeMs, Checkpoints 
checkpoints, Clock clock, VersionGCSupport vgc,
-            VersionGCOptions options, GCMonitor gcMonitor) {
+                                    VersionGCOptions options, GCMonitor 
gcMonitor, final boolean detailedGCEnabled) {

Review Comment:
   One general aspect wrt VersionGCRecommendations came to mind while reviewing 
this : to some extend we now introduce a second GC - while it is embedded in 
the classic GC, the second, detail GC is separated in that it has its own 
from/to timestamps and can be at different positions in the GC process, and 
parts of the code are therefore separate (like the DetailedGC class). On the 
other hand, this VersionGCRecommendations contains both "GC lanes" - so is not 
separated.
   
   From that, I was wondering whether we should separate the recommendation 
aspect between the two as well, so we'd leave the old class as is and introduce 
a detailGC variant of it. That way, we'd not risk modifying anything in the 
classic GC recommendation part - but it also might make code 
reading/maintenance easier.
   
   Wdyt?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         * @param rec {@link VersionGCRecommendations} to recommend GC 
operation
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+            final long toModified = rec.scopeDetailedGC.toMs;
+            long oldModifiedMs = oldestModifiedMs;
+            final String oldestModifiedDocId = rec.detailedGCId;
+            try (DetailedGC gc = new DetailedGC(headRevision, monitor, 
cancel)) {
+                long fromModified = oldestModifiedMs;
+                String fromId = 
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+                NodeDocument lastDoc;
+                if (phases.start(GCPhase.DETAILED_GC)) {
+                    while (foundDoc && fromModified < toModified && 
docsTraversed < PROGRESS_BATCH_SIZE) {
+                        // set foundDoc to false to allow exiting the while 
loop
+                        foundDoc = false;
+                        lastDoc = null;
+                        Iterable<NodeDocument> itr = 
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);
+                        try {
+                            for (NodeDocument doc : itr) {
+                                foundDoc = true;
+                                // continue with GC?
+                                if (cancel.get()) {
+                                    foundDoc = false; // to exit while loop as 
well
+                                    log.info("Received GC cancel call. 
Terminating the GC Operation.");
+                                    break;
+                                }
+                                docsTraversed++;
+                                if (docsTraversed % 100 == 0) {
+                                    monitor.info("Iterated through {} 
documents so far. {} had detail garbage",
+                                            docsTraversed, 
gc.getGarbageDocsCount());
+                                }
+
+                                lastDoc = doc;
+                                // collect the data to delete in next step
+                                if (phases.start(GCPhase.COLLECTING)) {
+                                    gc.collectGarbage(doc, phases);
+                                    phases.stop(GCPhase.COLLECTING);
+                                }
+
+                                final Long modified = lastDoc.getModified();
+                                if (modified == null) {
+                                    monitor.warn("collectDetailGarbage : 
document has no _modified property : {}",
+                                            doc.getId());
+                                } else if (SECONDS.toMillis(modified) < 
oldestModifiedMs) {

Review Comment:
   Is this correct? Shouldn't it be `fromModified` here instead of 
`oldestModifiedMs`?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/Utils.java:
##########
@@ -894,6 +904,17 @@ public static boolean isThrottlingEnabled(final 
DocumentNodeStoreBuilder<?> buil
         return builder.isThrottlingEnabled() || (docStoreThrottlingFeature != 
null && docStoreThrottlingFeature.isEnabled());
     }
 
+    /**
+     * Check whether detailed GC is enabled or not for document store.
+     *
+     * @param builder instance for DocumentNodeStoreBuilder
+     * @return true if detailed GC is enabled else false
+     */
+    public static boolean isDetailedGCEnabled(final 
DocumentNodeStoreBuilder<?> builder) {
+        final Feature docStoreDetailedGCFeature = 
builder.getDocStoreDetailedGCFeature();
+        return builder.isDetailedGCEnabled() || (docStoreDetailedGCFeature != 
null && docStoreDetailedGCFeature.isEnabled());

Review Comment:
   Noticed that for throttling it was done this way as well. But still, I'm 
wondering if the behaviour should be AND, nor OR. I think it would be useful if 
disabling the feature toggle would forcefully disable the detail GC feature - 
but IIUC that's currently not the case, as it could still be enabled via 
config? 



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         * @param rec {@link VersionGCRecommendations} to recommend GC 
operation
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+            final long toModified = rec.scopeDetailedGC.toMs;
+            long oldModifiedMs = oldestModifiedMs;
+            final String oldestModifiedDocId = rec.detailedGCId;
+            try (DetailedGC gc = new DetailedGC(headRevision, monitor, 
cancel)) {
+                long fromModified = oldestModifiedMs;
+                String fromId = 
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+                NodeDocument lastDoc;
+                if (phases.start(GCPhase.DETAILED_GC)) {
+                    while (foundDoc && fromModified < toModified && 
docsTraversed < PROGRESS_BATCH_SIZE) {
+                        // set foundDoc to false to allow exiting the while 
loop
+                        foundDoc = false;
+                        lastDoc = null;
+                        Iterable<NodeDocument> itr = 
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);
+                        try {
+                            for (NodeDocument doc : itr) {
+                                foundDoc = true;
+                                // continue with GC?
+                                if (cancel.get()) {
+                                    foundDoc = false; // to exit while loop as 
well
+                                    log.info("Received GC cancel call. 
Terminating the GC Operation.");
+                                    break;
+                                }
+                                docsTraversed++;
+                                if (docsTraversed % 100 == 0) {
+                                    monitor.info("Iterated through {} 
documents so far. {} had detail garbage",
+                                            docsTraversed, 
gc.getGarbageDocsCount());

Review Comment:
   It looks like the two parameters here are not in sync : the `docsTraversed` 
is the total number of documents traversed in this entire G run - while 
`getGarbageDocsCount` is only covering 1 iteration (inner loop). 



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/NodeDocument.java:
##########
@@ -1669,6 +1670,19 @@ SortedMap<Revision, String> getLocalMap(String key) {
         return map;
     }
 
+    /**
+     * Returns name of all the properties on this document

Review Comment:
   Should we add a note in the javadoc here that these names are escaped?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/mongo/MongoVersionGCSupport.java:
##########
@@ -193,6 +238,31 @@ public void apply(BasicDBObject document) {
         return result.get(0);
     }
 
+    /**
+     * Retrieve the time of the oldest modified document.
+     *
+     * @param clock System Clock to measure time in accuracy of millis
+     * @return the timestamp of the oldest modified document.
+     */
+    @Override
+    public Optional<NodeDocument> getOldestModifiedDoc(final Clock clock) {
+        LOG.info("getOldestModifiedDoc() <- start");

Review Comment:
   (see other comment, left-over `info` statement? maybe `debug` or `trace` 
instead?)



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBDocumentStore.java:
##########
@@ -971,8 +971,8 @@ private <T extends Document> void 
toMapBuilder(ImmutableMap.Builder<String, Stri
     public static String VERSIONPROP = "__version";
 
     // set of supported indexed properties
-    private static final Set<String> INDEXEDPROPERTIES = new 
HashSet<String>(Arrays.asList(new String[] { MODIFIED,
-            NodeDocument.HAS_BINARY_FLAG, NodeDocument.DELETED_ONCE, 
NodeDocument.SD_TYPE, NodeDocument.SD_MAX_REV_TIME_IN_SECS, VERSIONPROP }));
+    private static final Set<String> INDEXEDPROPERTIES = new 
HashSet<>(Arrays.asList(MODIFIED,

Review Comment:
   Might be worth while if @reschke could comment on RDB changes.



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         * @param rec {@link VersionGCRecommendations} to recommend GC 
operation
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+            final long toModified = rec.scopeDetailedGC.toMs;
+            long oldModifiedMs = oldestModifiedMs;
+            final String oldestModifiedDocId = rec.detailedGCId;
+            try (DetailedGC gc = new DetailedGC(headRevision, monitor, 
cancel)) {
+                long fromModified = oldestModifiedMs;
+                String fromId = 
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+                NodeDocument lastDoc;
+                if (phases.start(GCPhase.DETAILED_GC)) {
+                    while (foundDoc && fromModified < toModified && 
docsTraversed < PROGRESS_BATCH_SIZE) {
+                        // set foundDoc to false to allow exiting the while 
loop
+                        foundDoc = false;
+                        lastDoc = null;
+                        Iterable<NodeDocument> itr = 
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);

Review Comment:
   The `1000` here should probably be replaced with a const, I guess with 
`PROGRESS_BATCH_SIZE`. Which leads to a second aspect : I think there should be 
a test case for this that verifies this aspect. Currently `PROGRESS_BATCH_SIZE` 
is set to `10000`, but here `1000` was used - in which case I think the loop 
condition would have to be broken in a way. But presumably that was not the 
case (assuming the tests work).



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -356,6 +421,11 @@ private enum GCPhase {
         DELETING,
         SORTING,
         SPLITS_CLEANUP,
+        DETAILED_GC,
+        COLLECT_PROPS,
+        COLLECT_OLD_REVS,
+        COLLECT_UNMERGED_BC,

Review Comment:
   Was wondering if the three new ones should get a detail gc prefix, to more 
easily distinguish them from classic?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCRecommendations.java:
##########
@@ -209,28 +249,75 @@ public void evaluate(VersionGCStats stats) {
             }
             stats.needRepeat = !scopeIsComplete;
         }
+
+        // save data for detailed GC
+        if (detailedGCEnabled && !stats.canceled && 
!stats.ignoredDetailGCDueToCheckPoint) {
+            // success, we would not expect to encounter revisions older than 
this in the future
+            setLongSetting(SETTINGS_COLLECTION_DETAILED_GC_TIMESTAMP_PROP, 
stats.oldestModifiedDocTimeStamp);
+            setStringSetting(SETTINGS_COLLECTION_DETAILED_GC_DOCUMENT_ID_PROP, 
stats.oldestModifiedDocId);
+        }
     }
 
-    private Map<String, Long> getLongSettings() {
-        Document versionGCDoc = 
vgc.getDocumentStore().find(Collection.SETTINGS, 
VersionGarbageCollector.SETTINGS_COLLECTION_ID, 0);
-        Map<String, Long> settings = Maps.newHashMap();
+    private Map<String, Object> getVGCSettings() {
+        Document versionGCDoc = 
vgc.getDocumentStore().find(Collection.SETTINGS, SETTINGS_COLLECTION_ID, 0);
+        Map<String, Object> settings = new HashMap<>();
         // default values
-        
settings.put(VersionGarbageCollector.SETTINGS_COLLECTION_OLDEST_TIMESTAMP_PROP, 
0L);
-        
settings.put(VersionGarbageCollector.SETTINGS_COLLECTION_REC_INTERVAL_PROP, 0L);
+        settings.put(SETTINGS_COLLECTION_OLDEST_TIMESTAMP_PROP, 0L);
+        settings.put(SETTINGS_COLLECTION_REC_INTERVAL_PROP, 0L);
+        settings.put(SETTINGS_COLLECTION_DETAILED_GC_TIMESTAMP_PROP, 0L);
+        settings.put(SETTINGS_COLLECTION_DETAILED_GC_DOCUMENT_ID_PROP, 
MIN_ID_VALUE);
         if (versionGCDoc != null) {
             for (String k : versionGCDoc.keySet()) {
                 Object value = versionGCDoc.get(k);
                 if (value instanceof Number) {
                     settings.put(k, ((Number) value).longValue());
                 }
+                if (value instanceof String) {
+                    settings.put(k, value);
+                }
             }
         }
         return settings;
     }
 
     private void setLongSetting(String propName, long val) {
-        UpdateOp updateOp = new 
UpdateOp(VersionGarbageCollector.SETTINGS_COLLECTION_ID, true);
+        setLongSetting(of(propName, val));
+    }
+
+    private void setStringSetting(String propName, String val) {
+        UpdateOp updateOp = new UpdateOp(SETTINGS_COLLECTION_ID, true);
         updateOp.set(propName, val);
         vgc.getDocumentStore().createOrUpdate(Collection.SETTINGS, updateOp);
     }
+
+    private void setLongSetting(final Map<String, Long> propValMap) {
+        UpdateOp updateOp = new UpdateOp(SETTINGS_COLLECTION_ID, true);
+        propValMap.forEach(updateOp::set);
+        vgc.getDocumentStore().createOrUpdate(Collection.SETTINGS, updateOp);
+    }
+
+    @NotNull
+    private static GCResult getResult(VersionGCOptions options, boolean 
ignoreGC, TimeInterval gcScope, Revision checkpoint) {

Review Comment:
   Looks like `ignoreGC` is always `false` as parameter. Is it really needed?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -507,6 +581,25 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
                     collectDeletedDocuments(phases, headRevision, rec);
                     collectSplitDocuments(phases, sweepRevisions, rec);
                 }
+
+                // now run detailed GC if enabled
+                if (detailedGCEnabled) {
+                    if (rec.ignoreDetailGCDueToCheckPoint) {
+                        phases.stats.ignoredDetailGCDueToCheckPoint = true;
+                        monitor.skipped("Checkpoint prevented detailed 
revision garbage collection");
+                    } else {
+                        final RevisionVector headRevision = 
nodeStore.getHeadRevision();
+                        monitor.info("Looking at revisions in {} for detailed 
GC", rec.scopeDetailedGC);
+                        collectDetailedGarbage(phases, headRevision, rec);
+                    }
+                }
+
+                if (detailedGCEnabled && rec.ignoreDueToCheckPoint && 
rec.ignoreDetailGCDueToCheckPoint) {
+                    cancel.set(true);
+                } else if (!detailedGCEnabled && rec.ignoreDueToCheckPoint) {

Review Comment:
   This is a bit complicated to read, I think it could be simplified. Also, in 
detail gc case, is the first line really correct, is detail gc only cancelled 
if both "GC lanes" are ignored due to checkpoint?



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis) 
throws IOException {
             return stats;
         }
 
+        /**
+         * "Detail garbage" refers to additional garbage identified as part of 
OAK-10199
+         * et al: essentially garbage that in earlier versions of Oak were 
ignored. This
+         * includes: deleted properties, revision information within 
documents, branch
+         * commit related garbage.
+         * <p/>
+         * TODO: limit this to run only on a singleton instance, eg the 
cluster leader
+         * <p/>
+         * The "detail garbage" collector can be instructed to do a full 
repository scan
+         * - or incrementally based on where it last left off. When doing a 
full
+         * repository scan (but not limited to that), it executes in (small) 
batches
+         * followed by voluntary paused (aka throttling) to avoid excessive 
load on the
+         * system. The full repository scan does not have to finish 
particularly fast,
+         * it is okay that it takes a considerable amount of time.
+         *
+         * @param phases {@link GCPhases}
+         * @param headRevision the current head revision of node store
+         * @param rec {@link VersionGCRecommendations} to recommend GC 
operation
+         */
+        private void collectDetailedGarbage(final GCPhases phases, final 
RevisionVector headRevision, final VersionGCRecommendations rec)
+                throws IOException {
+            int docsTraversed = 0;
+            boolean foundDoc = true;
+            final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+            final long toModified = rec.scopeDetailedGC.toMs;

Review Comment:
   Made a similar comment elsewhere about being more explicit whether it is 
milliseconds or seconds. I think particularly in this method it might not hurt 
if `toModified` was eg `toModifiedMs`? It is at least otherwise in contrast 
with `oldestModifiedMs`, which has millis in the name..



##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -605,8 +812,157 @@ private void collectDeletedDocuments(GCPhases phases,
                     gc.updateResurrectedDocuments(phases.stats);
                     phases.stop(GCPhase.UPDATING);
                 }
+            }
+        }
+    }
+
+    private class DetailedGC implements Closeable {
+
+        private final RevisionVector headRevision;
+        private final GCMonitor monitor;
+        private final AtomicBoolean cancel;
+        private final Stopwatch timer;
+        private final List<UpdateOp> updateOpList;
+
+        private final Map<String, Integer> deletedPropsCountMap;
+        private int garbageDocsCount;
+
+        public DetailedGC(@NotNull RevisionVector headRevision, @NotNull 
GCMonitor monitor, @NotNull AtomicBoolean cancel) {
+            this.headRevision = requireNonNull(headRevision);
+            this.monitor = monitor;
+            this.cancel = cancel;
+            this.updateOpList = new ArrayList<>();
+            this.deletedPropsCountMap = new HashMap<>();
+            this.timer = Stopwatch.createUnstarted();
+        }
+
+        public void collectGarbage(final NodeDocument doc, final GCPhases 
phases) {
+
+            monitor.info("Collecting Detailed Garbage for doc [{}]", 
doc.getId());
+
+            final UpdateOp op = new UpdateOp(requireNonNull(doc.getId()), 
false);
+            op.equals(MODIFIED_IN_SECS, doc.getModified());
+
+            collectDeletedProperties(doc, phases, op);
+            collectUnmergedBranchCommitDocument(doc, phases, op);
+            collectOldRevisions(doc, phases, op);
+            // only add if there are changes for this doc
+            if (op.hasChanges()) {
+                garbageDocsCount++;
+                monitor.info("Collected [{}] garbage for doc [{}]", 
op.getChanges().size(), doc.getId());
+                updateOpList.add(op);
+            }
+        }
+
+        private boolean hasGarbage() {
+            return garbageDocsCount > 0;
+        }
+
+        private void collectUnmergedBranchCommitDocument(final NodeDocument 
doc, final GCPhases phases, final UpdateOp updateOp) {
+            if (phases.start(GCPhase.COLLECT_UNMERGED_BC)){
+                // TODO add umerged BC collection logic
+                phases.stop(GCPhase.COLLECT_UNMERGED_BC);
+            }
+
+        }
+
+        private void collectDeletedProperties(final NodeDocument doc, final 
GCPhases phases, final UpdateOp updateOp) {
+
+            // get Map of all properties along with their values
+            if (phases.start(GCPhase.COLLECT_PROPS)) {
+                final Set<String> properties = doc.getPropertyNames();
+
+                // find all the properties which can be removed from document.
+                // All the properties whose value is null in head revision are
+                // eligible to be garbage collected.
+
+                final Set<String> retainPropSet = 
ofNullable(doc.getNodeAtRevision(nodeStore, headRevision, null))
+                        .map(DocumentNodeState::getPropertyNames)
+                        .map(p -> 
p.stream().map(Utils::escapePropertyName).collect(toSet()))
+                        .orElse(emptySet());
+
+                final int deletedPropsGCCount = properties.stream()
+                        .filter(p -> !retainPropSet.contains(p))
+                        .mapToInt(x -> {
+                            updateOp.remove(x);
+                            return 1;})
+                        .sum();
+
+                deletedPropsCountMap.put(doc.getId(), deletedPropsGCCount);
+
+                if (log.isDebugEnabled()) {
+                    log.debug("Collected {} deleted properties for document 
{}", deletedPropsGCCount, doc.getId());
+                }
+                phases.stop(GCPhase.COLLECT_PROPS);
+            }
+        }
+
+        private void collectOldRevisions(NodeDocument doc, GCPhases phases, 
UpdateOp updateOp) {
+
+            if (phases.start(GCPhase.COLLECT_OLD_REVS)){
+                // TODO add old rev collection logic
+                phases.stop(GCPhase.COLLECT_OLD_REVS);
+            }
+
+        }
+
+        int getGarbageDocsCount() {
+            return garbageDocsCount;
+        }
+
+        @Override
+        public void close() throws IOException {
+
+        }
+
+        public void removeGarbage(final VersionGCStats stats) {
+
+            if (updateOpList.isEmpty()) {
+                if (log.isDebugEnabled()) {
+                    log.debug("Skipping removal of detailed garbage, cause no 
garbage detected");
+                }
+                return;
+            }
+
+            int updatedDocs;
+
+            monitor.info("Proceeding to update [{}] documents", 
updateOpList.size());
+
+            if (log.isDebugEnabled()) {
+                String collect = 
updateOpList.stream().map(UpdateOp::getId).collect(joining(","));
+                log.debug("Performing batch update of documents with following 
id's [{}]", collect);
+            }
+
+            if (cancel.get()) {
+                log.info("Aborting the removal of detailed garbage since RGC 
had been cancelled");
+                return;
+            }
+
+            timer.reset().start();
+            try {
+                List<NodeDocument> oldDocs = ds.findAndUpdate(NODES, 
updateOpList);
+                int deletedProps = 
oldDocs.stream().filter(Objects::nonNull).mapToInt(d -> 
deletedPropsCountMap.getOrDefault(d.getId(), 0)).sum();
+                updatedDocs = (int) 
oldDocs.stream().filter(Objects::nonNull).count();
+                stats.updatedDetailedGCDocsCount += updatedDocs;
+                stats.deletedPropsGCCount += deletedProps;
+                log.info("Updated [{}] documents, deleted [{}] properties", 
updatedDocs, deletedProps);
+                // now reset delete metadata
+                updateOpList.clear();
+                garbageDocsCount = 0;

Review Comment:
   `deletedPropsCountMap` should also be cleared here I think



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to