stefan-egli commented on code in PR #993:
URL: https://github.com/apache/jackrabbit-oak/pull/993#discussion_r1274942619
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCSupport.java:
##########
@@ -160,6 +203,29 @@ public long getOldestDeletedOnceTimestamp(Clock clock,
long precisionMs) {
return ts;
}
+ /**
+ * Retrieve the oldest modified document.
+ *
+ * @return the oldest modified document.
+ */
+ public Optional<NodeDocument> getOldestModifiedDoc(final Clock clock) {
+ long ts = 0;
+ long now = clock.getTime();
+ Iterable<NodeDocument> docs = null;
+
+ LOG.info("find oldest modified document");
Review Comment:
Noticed in several places in the PR that there are `LOG.info` statements
(some replacing `debug`). I do agree we should have a reasonable level of
detail logging by default, so `info` is definitely good in some cases. But I'm
not sure whether some of these `info` are not leftovers or are indeed here on
purpose. This one for example doesn't log any parameters, so at least that
should be added. But it also might be somewhat noisy?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/mongo/MongoVersionGCSupport.java:
##########
@@ -120,6 +131,40 @@ public CloseableIterable<NodeDocument>
getPossiblyDeletedDocs(final long fromMod
input -> store.convertFromDBObject(NODES, input)));
}
+ /**
+ * Returns documents that have a {@link NodeDocument#MODIFIED_IN_SECS}
value
+ * within the given range and are greater than given @{@link
NodeDocument#ID}.
+ * <p>
+ * The two passed modified timestamps are in milliseconds
+ * since the epoch and the implementation will convert them to seconds at
+ * the granularity of the {@link NodeDocument#MODIFIED_IN_SECS} field and
+ * then perform the comparison.
+ * <p/>
+ *
+ * @param fromModified the lower bound modified timestamp (inclusive)
+ * @param toModified the upper bound modified timestamp (exclusive)
Review Comment:
Noticed that we're not going into details as to whether this is seconds or
milliseconds in other places of this class neither. But with new code, I was
wondering whether we shouldn't just state here that it is milliseconds. Makes
it a bit more explicit and reader friendly.
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
return stats;
}
+ /**
+ * "Detail garbage" refers to additional garbage identified as part of
OAK-10199
+ * et al: essentially garbage that in earlier versions of Oak were
ignored. This
+ * includes: deleted properties, revision information within
documents, branch
+ * commit related garbage.
+ * <p/>
+ * TODO: limit this to run only on a singleton instance, eg the
cluster leader
+ * <p/>
+ * The "detail garbage" collector can be instructed to do a full
repository scan
+ * - or incrementally based on where it last left off. When doing a
full
+ * repository scan (but not limited to that), it executes in (small)
batches
+ * followed by voluntary paused (aka throttling) to avoid excessive
load on the
+ * system. The full repository scan does not have to finish
particularly fast,
+ * it is okay that it takes a considerable amount of time.
+ *
+ * @param phases {@link GCPhases}
+ * @param headRevision the current head revision of node store
+ * @param rec {@link VersionGCRecommendations} to recommend GC
operation
+ */
+ private void collectDetailedGarbage(final GCPhases phases, final
RevisionVector headRevision, final VersionGCRecommendations rec)
+ throws IOException {
+ int docsTraversed = 0;
+ boolean foundDoc = true;
+ final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+ final long toModified = rec.scopeDetailedGC.toMs;
+ long oldModifiedMs = oldestModifiedMs;
+ final String oldestModifiedDocId = rec.detailedGCId;
Review Comment:
I think moving all the `final` variables here to the top of the method might
make it a bit more readable..
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCRecommendations.java:
##########
@@ -73,35 +89,57 @@ public class VersionGCRecommendations {
* @param vgc VersionGC support class
* @param options options for running the gc
* @param gcMonitor monitor class for messages
+ * @param detailedGCEnabled whether detailedGC is enabled or not
*/
public VersionGCRecommendations(long maxRevisionAgeMs, Checkpoints
checkpoints, Clock clock, VersionGCSupport vgc,
- VersionGCOptions options, GCMonitor gcMonitor) {
+ VersionGCOptions options, GCMonitor
gcMonitor, final boolean detailedGCEnabled) {
Review Comment:
One general aspect wrt VersionGCRecommendations came to mind while reviewing
this : to some extend we now introduce a second GC - while it is embedded in
the classic GC, the second, detail GC is separated in that it has its own
from/to timestamps and can be at different positions in the GC process, and
parts of the code are therefore separate (like the DetailedGC class). On the
other hand, this VersionGCRecommendations contains both "GC lanes" - so is not
separated.
From that, I was wondering whether we should separate the recommendation
aspect between the two as well, so we'd leave the old class as is and introduce
a detailGC variant of it. That way, we'd not risk modifying anything in the
classic GC recommendation part - but it also might make code
reading/maintenance easier.
Wdyt?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
return stats;
}
+ /**
+ * "Detail garbage" refers to additional garbage identified as part of
OAK-10199
+ * et al: essentially garbage that in earlier versions of Oak were
ignored. This
+ * includes: deleted properties, revision information within
documents, branch
+ * commit related garbage.
+ * <p/>
+ * TODO: limit this to run only on a singleton instance, eg the
cluster leader
+ * <p/>
+ * The "detail garbage" collector can be instructed to do a full
repository scan
+ * - or incrementally based on where it last left off. When doing a
full
+ * repository scan (but not limited to that), it executes in (small)
batches
+ * followed by voluntary paused (aka throttling) to avoid excessive
load on the
+ * system. The full repository scan does not have to finish
particularly fast,
+ * it is okay that it takes a considerable amount of time.
+ *
+ * @param phases {@link GCPhases}
+ * @param headRevision the current head revision of node store
+ * @param rec {@link VersionGCRecommendations} to recommend GC
operation
+ */
+ private void collectDetailedGarbage(final GCPhases phases, final
RevisionVector headRevision, final VersionGCRecommendations rec)
+ throws IOException {
+ int docsTraversed = 0;
+ boolean foundDoc = true;
+ final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+ final long toModified = rec.scopeDetailedGC.toMs;
+ long oldModifiedMs = oldestModifiedMs;
+ final String oldestModifiedDocId = rec.detailedGCId;
+ try (DetailedGC gc = new DetailedGC(headRevision, monitor,
cancel)) {
+ long fromModified = oldestModifiedMs;
+ String fromId =
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+ NodeDocument lastDoc;
+ if (phases.start(GCPhase.DETAILED_GC)) {
+ while (foundDoc && fromModified < toModified &&
docsTraversed < PROGRESS_BATCH_SIZE) {
+ // set foundDoc to false to allow exiting the while
loop
+ foundDoc = false;
+ lastDoc = null;
+ Iterable<NodeDocument> itr =
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);
+ try {
+ for (NodeDocument doc : itr) {
+ foundDoc = true;
+ // continue with GC?
+ if (cancel.get()) {
+ foundDoc = false; // to exit while loop as
well
+ log.info("Received GC cancel call.
Terminating the GC Operation.");
+ break;
+ }
+ docsTraversed++;
+ if (docsTraversed % 100 == 0) {
+ monitor.info("Iterated through {}
documents so far. {} had detail garbage",
+ docsTraversed,
gc.getGarbageDocsCount());
+ }
+
+ lastDoc = doc;
+ // collect the data to delete in next step
+ if (phases.start(GCPhase.COLLECTING)) {
+ gc.collectGarbage(doc, phases);
+ phases.stop(GCPhase.COLLECTING);
+ }
+
+ final Long modified = lastDoc.getModified();
+ if (modified == null) {
+ monitor.warn("collectDetailGarbage :
document has no _modified property : {}",
+ doc.getId());
+ } else if (SECONDS.toMillis(modified) <
oldestModifiedMs) {
Review Comment:
Is this correct? Shouldn't it be `fromModified` here instead of
`oldestModifiedMs`?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/Utils.java:
##########
@@ -894,6 +904,17 @@ public static boolean isThrottlingEnabled(final
DocumentNodeStoreBuilder<?> buil
return builder.isThrottlingEnabled() || (docStoreThrottlingFeature !=
null && docStoreThrottlingFeature.isEnabled());
}
+ /**
+ * Check whether detailed GC is enabled or not for document store.
+ *
+ * @param builder instance for DocumentNodeStoreBuilder
+ * @return true if detailed GC is enabled else false
+ */
+ public static boolean isDetailedGCEnabled(final
DocumentNodeStoreBuilder<?> builder) {
+ final Feature docStoreDetailedGCFeature =
builder.getDocStoreDetailedGCFeature();
+ return builder.isDetailedGCEnabled() || (docStoreDetailedGCFeature !=
null && docStoreDetailedGCFeature.isEnabled());
Review Comment:
Noticed that for throttling it was done this way as well. But still, I'm
wondering if the behaviour should be AND, nor OR. I think it would be useful if
disabling the feature toggle would forcefully disable the detail GC feature -
but IIUC that's currently not the case, as it could still be enabled via
config?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
return stats;
}
+ /**
+ * "Detail garbage" refers to additional garbage identified as part of
OAK-10199
+ * et al: essentially garbage that in earlier versions of Oak were
ignored. This
+ * includes: deleted properties, revision information within
documents, branch
+ * commit related garbage.
+ * <p/>
+ * TODO: limit this to run only on a singleton instance, eg the
cluster leader
+ * <p/>
+ * The "detail garbage" collector can be instructed to do a full
repository scan
+ * - or incrementally based on where it last left off. When doing a
full
+ * repository scan (but not limited to that), it executes in (small)
batches
+ * followed by voluntary paused (aka throttling) to avoid excessive
load on the
+ * system. The full repository scan does not have to finish
particularly fast,
+ * it is okay that it takes a considerable amount of time.
+ *
+ * @param phases {@link GCPhases}
+ * @param headRevision the current head revision of node store
+ * @param rec {@link VersionGCRecommendations} to recommend GC
operation
+ */
+ private void collectDetailedGarbage(final GCPhases phases, final
RevisionVector headRevision, final VersionGCRecommendations rec)
+ throws IOException {
+ int docsTraversed = 0;
+ boolean foundDoc = true;
+ final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+ final long toModified = rec.scopeDetailedGC.toMs;
+ long oldModifiedMs = oldestModifiedMs;
+ final String oldestModifiedDocId = rec.detailedGCId;
+ try (DetailedGC gc = new DetailedGC(headRevision, monitor,
cancel)) {
+ long fromModified = oldestModifiedMs;
+ String fromId =
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+ NodeDocument lastDoc;
+ if (phases.start(GCPhase.DETAILED_GC)) {
+ while (foundDoc && fromModified < toModified &&
docsTraversed < PROGRESS_BATCH_SIZE) {
+ // set foundDoc to false to allow exiting the while
loop
+ foundDoc = false;
+ lastDoc = null;
+ Iterable<NodeDocument> itr =
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);
+ try {
+ for (NodeDocument doc : itr) {
+ foundDoc = true;
+ // continue with GC?
+ if (cancel.get()) {
+ foundDoc = false; // to exit while loop as
well
+ log.info("Received GC cancel call.
Terminating the GC Operation.");
+ break;
+ }
+ docsTraversed++;
+ if (docsTraversed % 100 == 0) {
+ monitor.info("Iterated through {}
documents so far. {} had detail garbage",
+ docsTraversed,
gc.getGarbageDocsCount());
Review Comment:
It looks like the two parameters here are not in sync : the `docsTraversed`
is the total number of documents traversed in this entire G run - while
`getGarbageDocsCount` is only covering 1 iteration (inner loop).
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/NodeDocument.java:
##########
@@ -1669,6 +1670,19 @@ SortedMap<Revision, String> getLocalMap(String key) {
return map;
}
+ /**
+ * Returns name of all the properties on this document
Review Comment:
Should we add a note in the javadoc here that these names are escaped?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/mongo/MongoVersionGCSupport.java:
##########
@@ -193,6 +238,31 @@ public void apply(BasicDBObject document) {
return result.get(0);
}
+ /**
+ * Retrieve the time of the oldest modified document.
+ *
+ * @param clock System Clock to measure time in accuracy of millis
+ * @return the timestamp of the oldest modified document.
+ */
+ @Override
+ public Optional<NodeDocument> getOldestModifiedDoc(final Clock clock) {
+ LOG.info("getOldestModifiedDoc() <- start");
Review Comment:
(see other comment, left-over `info` statement? maybe `debug` or `trace`
instead?)
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBDocumentStore.java:
##########
@@ -971,8 +971,8 @@ private <T extends Document> void
toMapBuilder(ImmutableMap.Builder<String, Stri
public static String VERSIONPROP = "__version";
// set of supported indexed properties
- private static final Set<String> INDEXEDPROPERTIES = new
HashSet<String>(Arrays.asList(new String[] { MODIFIED,
- NodeDocument.HAS_BINARY_FLAG, NodeDocument.DELETED_ONCE,
NodeDocument.SD_TYPE, NodeDocument.SD_MAX_REV_TIME_IN_SECS, VERSIONPROP }));
+ private static final Set<String> INDEXEDPROPERTIES = new
HashSet<>(Arrays.asList(MODIFIED,
Review Comment:
Might be worth while if @reschke could comment on RDB changes.
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
return stats;
}
+ /**
+ * "Detail garbage" refers to additional garbage identified as part of
OAK-10199
+ * et al: essentially garbage that in earlier versions of Oak were
ignored. This
+ * includes: deleted properties, revision information within
documents, branch
+ * commit related garbage.
+ * <p/>
+ * TODO: limit this to run only on a singleton instance, eg the
cluster leader
+ * <p/>
+ * The "detail garbage" collector can be instructed to do a full
repository scan
+ * - or incrementally based on where it last left off. When doing a
full
+ * repository scan (but not limited to that), it executes in (small)
batches
+ * followed by voluntary paused (aka throttling) to avoid excessive
load on the
+ * system. The full repository scan does not have to finish
particularly fast,
+ * it is okay that it takes a considerable amount of time.
+ *
+ * @param phases {@link GCPhases}
+ * @param headRevision the current head revision of node store
+ * @param rec {@link VersionGCRecommendations} to recommend GC
operation
+ */
+ private void collectDetailedGarbage(final GCPhases phases, final
RevisionVector headRevision, final VersionGCRecommendations rec)
+ throws IOException {
+ int docsTraversed = 0;
+ boolean foundDoc = true;
+ final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+ final long toModified = rec.scopeDetailedGC.toMs;
+ long oldModifiedMs = oldestModifiedMs;
+ final String oldestModifiedDocId = rec.detailedGCId;
+ try (DetailedGC gc = new DetailedGC(headRevision, monitor,
cancel)) {
+ long fromModified = oldestModifiedMs;
+ String fromId =
ofNullable(oldestModifiedDocId).orElse(MIN_ID_VALUE);
+ NodeDocument lastDoc;
+ if (phases.start(GCPhase.DETAILED_GC)) {
+ while (foundDoc && fromModified < toModified &&
docsTraversed < PROGRESS_BATCH_SIZE) {
+ // set foundDoc to false to allow exiting the while
loop
+ foundDoc = false;
+ lastDoc = null;
+ Iterable<NodeDocument> itr =
versionStore.getModifiedDocs(fromModified, toModified, 1000, fromId);
Review Comment:
The `1000` here should probably be replaced with a const, I guess with
`PROGRESS_BATCH_SIZE`. Which leads to a second aspect : I think there should be
a test case for this that verifies this aspect. Currently `PROGRESS_BATCH_SIZE`
is set to `10000`, but here `1000` was used - in which case I think the loop
condition would have to be broken in a way. But presumably that was not the
case (assuming the tests work).
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -356,6 +421,11 @@ private enum GCPhase {
DELETING,
SORTING,
SPLITS_CLEANUP,
+ DETAILED_GC,
+ COLLECT_PROPS,
+ COLLECT_OLD_REVS,
+ COLLECT_UNMERGED_BC,
Review Comment:
Was wondering if the three new ones should get a detail gc prefix, to more
easily distinguish them from classic?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGCRecommendations.java:
##########
@@ -209,28 +249,75 @@ public void evaluate(VersionGCStats stats) {
}
stats.needRepeat = !scopeIsComplete;
}
+
+ // save data for detailed GC
+ if (detailedGCEnabled && !stats.canceled &&
!stats.ignoredDetailGCDueToCheckPoint) {
+ // success, we would not expect to encounter revisions older than
this in the future
+ setLongSetting(SETTINGS_COLLECTION_DETAILED_GC_TIMESTAMP_PROP,
stats.oldestModifiedDocTimeStamp);
+ setStringSetting(SETTINGS_COLLECTION_DETAILED_GC_DOCUMENT_ID_PROP,
stats.oldestModifiedDocId);
+ }
}
- private Map<String, Long> getLongSettings() {
- Document versionGCDoc =
vgc.getDocumentStore().find(Collection.SETTINGS,
VersionGarbageCollector.SETTINGS_COLLECTION_ID, 0);
- Map<String, Long> settings = Maps.newHashMap();
+ private Map<String, Object> getVGCSettings() {
+ Document versionGCDoc =
vgc.getDocumentStore().find(Collection.SETTINGS, SETTINGS_COLLECTION_ID, 0);
+ Map<String, Object> settings = new HashMap<>();
// default values
-
settings.put(VersionGarbageCollector.SETTINGS_COLLECTION_OLDEST_TIMESTAMP_PROP,
0L);
-
settings.put(VersionGarbageCollector.SETTINGS_COLLECTION_REC_INTERVAL_PROP, 0L);
+ settings.put(SETTINGS_COLLECTION_OLDEST_TIMESTAMP_PROP, 0L);
+ settings.put(SETTINGS_COLLECTION_REC_INTERVAL_PROP, 0L);
+ settings.put(SETTINGS_COLLECTION_DETAILED_GC_TIMESTAMP_PROP, 0L);
+ settings.put(SETTINGS_COLLECTION_DETAILED_GC_DOCUMENT_ID_PROP,
MIN_ID_VALUE);
if (versionGCDoc != null) {
for (String k : versionGCDoc.keySet()) {
Object value = versionGCDoc.get(k);
if (value instanceof Number) {
settings.put(k, ((Number) value).longValue());
}
+ if (value instanceof String) {
+ settings.put(k, value);
+ }
}
}
return settings;
}
private void setLongSetting(String propName, long val) {
- UpdateOp updateOp = new
UpdateOp(VersionGarbageCollector.SETTINGS_COLLECTION_ID, true);
+ setLongSetting(of(propName, val));
+ }
+
+ private void setStringSetting(String propName, String val) {
+ UpdateOp updateOp = new UpdateOp(SETTINGS_COLLECTION_ID, true);
updateOp.set(propName, val);
vgc.getDocumentStore().createOrUpdate(Collection.SETTINGS, updateOp);
}
+
+ private void setLongSetting(final Map<String, Long> propValMap) {
+ UpdateOp updateOp = new UpdateOp(SETTINGS_COLLECTION_ID, true);
+ propValMap.forEach(updateOp::set);
+ vgc.getDocumentStore().createOrUpdate(Collection.SETTINGS, updateOp);
+ }
+
+ @NotNull
+ private static GCResult getResult(VersionGCOptions options, boolean
ignoreGC, TimeInterval gcScope, Revision checkpoint) {
Review Comment:
Looks like `ignoreGC` is always `false` as parameter. Is it really needed?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -507,6 +581,25 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
collectDeletedDocuments(phases, headRevision, rec);
collectSplitDocuments(phases, sweepRevisions, rec);
}
+
+ // now run detailed GC if enabled
+ if (detailedGCEnabled) {
+ if (rec.ignoreDetailGCDueToCheckPoint) {
+ phases.stats.ignoredDetailGCDueToCheckPoint = true;
+ monitor.skipped("Checkpoint prevented detailed
revision garbage collection");
+ } else {
+ final RevisionVector headRevision =
nodeStore.getHeadRevision();
+ monitor.info("Looking at revisions in {} for detailed
GC", rec.scopeDetailedGC);
+ collectDetailedGarbage(phases, headRevision, rec);
+ }
+ }
+
+ if (detailedGCEnabled && rec.ignoreDueToCheckPoint &&
rec.ignoreDetailGCDueToCheckPoint) {
+ cancel.set(true);
+ } else if (!detailedGCEnabled && rec.ignoreDueToCheckPoint) {
Review Comment:
This is a bit complicated to read, I think it could be simplified. Also, in
detail gc case, is the first line really correct, is detail gc only cancelled
if both "GC lanes" are ignored due to checkpoint?
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -521,6 +614,121 @@ private VersionGCStats gc(long maxRevisionAgeInMillis)
throws IOException {
return stats;
}
+ /**
+ * "Detail garbage" refers to additional garbage identified as part of
OAK-10199
+ * et al: essentially garbage that in earlier versions of Oak were
ignored. This
+ * includes: deleted properties, revision information within
documents, branch
+ * commit related garbage.
+ * <p/>
+ * TODO: limit this to run only on a singleton instance, eg the
cluster leader
+ * <p/>
+ * The "detail garbage" collector can be instructed to do a full
repository scan
+ * - or incrementally based on where it last left off. When doing a
full
+ * repository scan (but not limited to that), it executes in (small)
batches
+ * followed by voluntary paused (aka throttling) to avoid excessive
load on the
+ * system. The full repository scan does not have to finish
particularly fast,
+ * it is okay that it takes a considerable amount of time.
+ *
+ * @param phases {@link GCPhases}
+ * @param headRevision the current head revision of node store
+ * @param rec {@link VersionGCRecommendations} to recommend GC
operation
+ */
+ private void collectDetailedGarbage(final GCPhases phases, final
RevisionVector headRevision, final VersionGCRecommendations rec)
+ throws IOException {
+ int docsTraversed = 0;
+ boolean foundDoc = true;
+ final long oldestModifiedMs = rec.scopeDetailedGC.fromMs;
+ final long toModified = rec.scopeDetailedGC.toMs;
Review Comment:
Made a similar comment elsewhere about being more explicit whether it is
milliseconds or seconds. I think particularly in this method it might not hurt
if `toModified` was eg `toModifiedMs`? It is at least otherwise in contrast
with `oldestModifiedMs`, which has millis in the name..
##########
oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java:
##########
@@ -605,8 +812,157 @@ private void collectDeletedDocuments(GCPhases phases,
gc.updateResurrectedDocuments(phases.stats);
phases.stop(GCPhase.UPDATING);
}
+ }
+ }
+ }
+
+ private class DetailedGC implements Closeable {
+
+ private final RevisionVector headRevision;
+ private final GCMonitor monitor;
+ private final AtomicBoolean cancel;
+ private final Stopwatch timer;
+ private final List<UpdateOp> updateOpList;
+
+ private final Map<String, Integer> deletedPropsCountMap;
+ private int garbageDocsCount;
+
+ public DetailedGC(@NotNull RevisionVector headRevision, @NotNull
GCMonitor monitor, @NotNull AtomicBoolean cancel) {
+ this.headRevision = requireNonNull(headRevision);
+ this.monitor = monitor;
+ this.cancel = cancel;
+ this.updateOpList = new ArrayList<>();
+ this.deletedPropsCountMap = new HashMap<>();
+ this.timer = Stopwatch.createUnstarted();
+ }
+
+ public void collectGarbage(final NodeDocument doc, final GCPhases
phases) {
+
+ monitor.info("Collecting Detailed Garbage for doc [{}]",
doc.getId());
+
+ final UpdateOp op = new UpdateOp(requireNonNull(doc.getId()),
false);
+ op.equals(MODIFIED_IN_SECS, doc.getModified());
+
+ collectDeletedProperties(doc, phases, op);
+ collectUnmergedBranchCommitDocument(doc, phases, op);
+ collectOldRevisions(doc, phases, op);
+ // only add if there are changes for this doc
+ if (op.hasChanges()) {
+ garbageDocsCount++;
+ monitor.info("Collected [{}] garbage for doc [{}]",
op.getChanges().size(), doc.getId());
+ updateOpList.add(op);
+ }
+ }
+
+ private boolean hasGarbage() {
+ return garbageDocsCount > 0;
+ }
+
+ private void collectUnmergedBranchCommitDocument(final NodeDocument
doc, final GCPhases phases, final UpdateOp updateOp) {
+ if (phases.start(GCPhase.COLLECT_UNMERGED_BC)){
+ // TODO add umerged BC collection logic
+ phases.stop(GCPhase.COLLECT_UNMERGED_BC);
+ }
+
+ }
+
+ private void collectDeletedProperties(final NodeDocument doc, final
GCPhases phases, final UpdateOp updateOp) {
+
+ // get Map of all properties along with their values
+ if (phases.start(GCPhase.COLLECT_PROPS)) {
+ final Set<String> properties = doc.getPropertyNames();
+
+ // find all the properties which can be removed from document.
+ // All the properties whose value is null in head revision are
+ // eligible to be garbage collected.
+
+ final Set<String> retainPropSet =
ofNullable(doc.getNodeAtRevision(nodeStore, headRevision, null))
+ .map(DocumentNodeState::getPropertyNames)
+ .map(p ->
p.stream().map(Utils::escapePropertyName).collect(toSet()))
+ .orElse(emptySet());
+
+ final int deletedPropsGCCount = properties.stream()
+ .filter(p -> !retainPropSet.contains(p))
+ .mapToInt(x -> {
+ updateOp.remove(x);
+ return 1;})
+ .sum();
+
+ deletedPropsCountMap.put(doc.getId(), deletedPropsGCCount);
+
+ if (log.isDebugEnabled()) {
+ log.debug("Collected {} deleted properties for document
{}", deletedPropsGCCount, doc.getId());
+ }
+ phases.stop(GCPhase.COLLECT_PROPS);
+ }
+ }
+
+ private void collectOldRevisions(NodeDocument doc, GCPhases phases,
UpdateOp updateOp) {
+
+ if (phases.start(GCPhase.COLLECT_OLD_REVS)){
+ // TODO add old rev collection logic
+ phases.stop(GCPhase.COLLECT_OLD_REVS);
+ }
+
+ }
+
+ int getGarbageDocsCount() {
+ return garbageDocsCount;
+ }
+
+ @Override
+ public void close() throws IOException {
+
+ }
+
+ public void removeGarbage(final VersionGCStats stats) {
+
+ if (updateOpList.isEmpty()) {
+ if (log.isDebugEnabled()) {
+ log.debug("Skipping removal of detailed garbage, cause no
garbage detected");
+ }
+ return;
+ }
+
+ int updatedDocs;
+
+ monitor.info("Proceeding to update [{}] documents",
updateOpList.size());
+
+ if (log.isDebugEnabled()) {
+ String collect =
updateOpList.stream().map(UpdateOp::getId).collect(joining(","));
+ log.debug("Performing batch update of documents with following
id's [{}]", collect);
+ }
+
+ if (cancel.get()) {
+ log.info("Aborting the removal of detailed garbage since RGC
had been cancelled");
+ return;
+ }
+
+ timer.reset().start();
+ try {
+ List<NodeDocument> oldDocs = ds.findAndUpdate(NODES,
updateOpList);
+ int deletedProps =
oldDocs.stream().filter(Objects::nonNull).mapToInt(d ->
deletedPropsCountMap.getOrDefault(d.getId(), 0)).sum();
+ updatedDocs = (int)
oldDocs.stream().filter(Objects::nonNull).count();
+ stats.updatedDetailedGCDocsCount += updatedDocs;
+ stats.deletedPropsGCCount += deletedProps;
+ log.info("Updated [{}] documents, deleted [{}] properties",
updatedDocs, deletedProps);
+ // now reset delete metadata
+ updateOpList.clear();
+ garbageDocsCount = 0;
Review Comment:
`deletedPropsCountMap` should also be cleared here I think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]