mjwall opened a new issue #2322:
URL: https://github.com/apache/accumulo/issues/2322


   **Describe the bug**
   Similar to the issue described in #1916, the garbage collector did not 
remove candidates that were still in use.  So these rfile references were 
removed from HDFS
   
   **Versions (OS, Maven, Java, and others, as appropriate):**
    - Affected version(s) of this project: 1.9.4
    - OS: CentOS 7
    - Others:
   
   **To Reproduce**
   Unable to reproduce
   
   **Additional context**
   Unlike the issue in #1377 where the consistency checks were insufficient to 
catch this issue, an investigation into this issue shows that the consistency 
check should have caught this issue because the prior end rows didn't align.
   
   All files removed were hosted on 1 metadata split, indicating that the scan 
of that metadata tablet failed.  Looking at where that metadata tablet was 
hosted at the time of the failure, I found an automated report in the ticket 
system that shows an SSD failure on that box.  HDFS was using SSDs on that 
node.  The admins got the ticket and did a hot swap of the SSD drives without 
shutting down services.  The GC logs were gone by the time I got to the node, 
but what I believe happened is this.
   
   1. Scans were progressing through TabletIterator.  Scan on the hosting this 
tablet failed, the 
[TabletIterator](https://github.com/apache/accumulo/blob/1.10/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java#L146)
 logged a metadata inconsistency then 
[reset](https://github.com/apache/accumulo/blob/1.10/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java#L151)
 the scanner.
   2. The reset set the range back to start at the [missing 
tablet](https://github.com/apache/accumulo/blob/1.10/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java#L263)
 and scanned again.  The box was still in a funky state so HDFS again didn't 
return data.  The TabletIterator found 0 entries again for the tablet and 
[threw a 
TabletDeletedException](https://github.com/apache/accumulo/blob/1.10/server/base/src/main/java/org/apache/accumulo/server/util/TabletIterator.java#L268).
  
   3. The TabletDeletedException is just a RuntimeException and the 
TabletIterator continues scanning without stopping the entire GC process.
   
   This is the only way I can explain what happened, as the consistency checks 
should have caught this issue.  I would like to suggest that if a 
TabletDeletedException is thrown during the TabletIterator scan, we cancel the 
entire GC cycle and start again.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to