mjwall commented on issue #1916: URL: https://github.com/apache/accumulo/issues/1916#issuecomment-868465742
@ivakegg found something in the [TabletIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java) that could miss a section of the metadata table during scanning. This would cause those candidates not to be removed from the candidateMap when GC is [checking](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L176) and therefore still be part of the [candidateMap](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L311) where references are removed. This hypothesis is consistent with what was seen several times on a large cluster. Part of working out how this could happen is understanding how the client code handles a scan failure. Stepping through the code, I hit this section in [ScannerIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/core/src/main/java/org/apache/accumulo/core/client/impl/ScannerIterator.java#L93) which appears to swallow the errors. The first group of exceptions is logged at TRACE, the systems where we have seen this issue log at DEBUG. So if what I am seeing is correct, something as simple as a scan timeout in an unfortunate metadata range in the TabletIterator would not log anything and the consistency checks would not catch issue. Working to reproduce and prove this locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
