mjwall commented on issue #1916:
URL: https://github.com/apache/accumulo/issues/1916#issuecomment-868465742


   @ivakegg found something in the 
[TabletIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java)
 that could miss a section of the metadata table during scanning.  This would 
cause those candidates not to be removed from the candidateMap when GC is 
[checking](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L176)
 and therefore still be part of the 
[candidateMap](https://github.com/apache/accumulo/blob/rel/1.9.3/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L311)
 where references are removed.  This hypothesis is consistent with what was 
seen several times on a large cluster.  
   
   Part of working out how this could happen is understanding how the client 
code handles a scan failure.  Stepping through the code, I hit this section in 
[ScannerIterator](https://github.com/apache/accumulo/blob/rel/1.9.3/core/src/main/java/org/apache/accumulo/core/client/impl/ScannerIterator.java#L93)
 which appears to swallow the errors.  The first group of exceptions is logged 
at TRACE, the systems where we have seen this issue log at DEBUG.   
   
   So if what I am seeing is correct, something as simple as a scan timeout in 
an unfortunate metadata range in the TabletIterator would not log anything and 
the consistency checks would not catch issue.
   
   Working to reproduce and prove this locally. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to