keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r972356614
##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +225,46 @@ private long
removeBlipCandidates(GarbageCollectionEnvironment gce,
return blipCount;
}
+ @VisibleForTesting
+ /**
+ * Double check no tables were missed during GC
+ */
+ protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore,
Set<TableId> tableIdsSeen,
+ Set<TableId> tableIdsAfter) {
+
+ // if a table was added or deleted during this run, it is acceptable to not
+ // have seen those tables ids when scanning the metadata table. So get the
intersection
+ final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+ tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+ if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+ throw new RuntimeException("Garbage collection will not proceed because "
+ + "table ids were seen in the metadata table and none were seen
Zookeeper. "
+ + "This can have two causes. First, total number of tables going
to/from "
+ + "zero during a GC cycle will cause this. Second, it could be
caused by "
+ + "corruption of the metadata table and/or Zookeeper. Only the
second cause "
+ + "is problematic, but there is no way to distinguish between the
two causes "
+ + "so this GC cycle will not proceed. The first cause should be
transient "
+ + "and one would not expect to see this message repeated in
subsequent GC cycles.");
+ }
+
+ // From that intersection, remove all the table ids that were seen.
+ tableIdsMustHaveSeen.removeAll(tableIdsSeen);
+
+ // If anything is left then we missed a table and may not have removed
rfiles references
+ // from the candidates list that are acutally still in use, which would
+ // result in the rfiles being deleted in the next step of the GC process
+ if (!tableIdsMustHaveSeen.isEmpty()) {
+ log.error("TableIDs before: " + tableIdsBefore);
+ log.error("TableIDs after : " + tableIdsAfter);
+ log.error("TableIDs seen : " + tableIdsSeen);
+ log.error("TableIDs that should have been seen but were not: " +
tableIdsMustHaveSeen);
+ // maybe a scan failed?
+ throw new RuntimeException(
+ "Saw table IDs in ZK that were not in metadata table: " +
tableIdsMustHaveSeen);
+ }
Review Comment:
> The canonical determination of whether a table exists or not is that it
has an entry in ZK... this is created before metadata entries, and is the last
thing removed when a table is deleted.
Good catch, we need to consider table states to avoid this race condition.
I mentioned table states in #1377, but its been so long I had completely
forgotten about that edge case and I did not reread the issue until now.
When a table is created the following is done.
1. table is put in ZK w/ TableState.NEW
2. metadata table is populated
3. tables state is set to TableState.ONLINE or TableState.OFFLINE
When a table is deleted the following is done.
1. Table state is set to TableState.DELETING
2. entries are removed from metadata table
3. entries are removed from ZK
So from the perspective of GC, if we see a table with a state of
TableState.ONLINE or TableState.OFFLINE before and after scanning the metadata
table, then it must be seen in the metadata table unless there is a problem.
We need to get a `Map<TableId,TableState>` to properly do this check.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]