keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977593844
##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long
removeBlipCandidates(GarbageCollectionEnvironment gce,
return blipCount;
}
+ @VisibleForTesting
+ /**
+ * Double check no tables were missed during GC
+ */
+ protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore,
Set<TableId> tableIdsSeen,
+ Set<TableId> tableIdsAfter) {
+
+ // if a table was added or deleted during this run, it is acceptable to not
+ // have seen those tables ids when scanning the metadata table. So get the
intersection
+ final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+ tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+ if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+ throw new RuntimeException("Garbage collection will not proceed because "
+ + "table ids were seen in the metadata table and none were seen
Zookeeper. "
+ + "This can have two causes. First, total number of tables going
to/from "
+ + "zero during a GC cycle will cause this. Second, it could be
caused by "
+ + "corruption of the metadata table and/or Zookeeper. Only the
second cause "
+ + "is problematic, but there is no way to distinguish between the
two causes "
+ + "so this GC cycle will not proceed. The first cause should be
transient "
+ + "and one would not expect to see this message repeated in
subsequent GC cycles.");
+ }
Review Comment:
> We specifically have the initial GC delay shorter (30s) than the cycle
delay (5m), which makes this more likely to occur if the first table they
create is around the 30 second mark when the first GC cycle is starting to run.
I was thinking through this situation wondering how probable it was and in
the process determined that this particular situation is probably not possible.
For this case there would be no GC candidates. I wasn't sure what the garbage
collector did when there are no candidates. I looked and found the following
code.
https://github.com/apache/accumulo/blob/3e1225f0db469f8fa76be7cc4e72961696b57a2f/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L314-L333
Looking at that, if `candidatesIter` does not initially have a next then it
will not scan for references and run the validation added in this PR. I was
happy to find the GC does no unnecessary work when there are no candidates.
A user would need to at least do the following to possibly bump into this
message.
* create a table
* initiate activity that creates gc candidates like major compactions
* delete the table
The delete would have to overlap in time with the GC reading tables ids from
ZK and reading references from the metadata table. For a system with hardly
any tables or tablets reading them will be extremely fast making the
probability of a one off user operation overlapping extremely small. A user
has to delete a table within a 5ms to 10ms time window that happens after 30s
and then every 5 min to see this message. So it seems unlikely someone would
bump into this. It would be good to run through a situation like the above a
few times and see what happens, I will give that a try.
Where this message will likely be seen is on a system that has lots of
tables and tablets (makes GC cycle take longer) and frequently goes to/from
zero total tables.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]