errose28 opened a new pull request, #4867:
URL: https://github.com/apache/ozone/pull/4867

   This PR incorporates changes from #4838 so it has a place to write files 
used to check disk health. Leaving this as a draft until that PR is merged, 
which should shrink the diff considerably.
   
   ## What changes were proposed in this pull request?
   
   The volume scanner/disk checker currently only checks filesystem permissions 
and directory existence. It should also do write, sync, and read back from a 
file as well to touch the actual hardware and not just information in the OS 
cache.
   
   This PR switches from using the `DiskChecker` class from Hadoop to a new 
Ozone specific `DiskCheckUtil` that we have more control over. The Hadoop 
implementation was removed for the following reasons:
   - It does not reliably preserve the cause of failure. Many operations are 
done using methods from java.io instead of java.nio, so booleans are turned on 
failure instead of exceptions with error messages.
   - Lack of configuration in disk check files
     - The number of iterations performed and the size of the file are not 
configurable.
     - This PR omits the iterations feature, instead relying on consecutive 
volume scans.
   - It creates directories if they do not exist.
     - This could mask a missing mountpoint and cause data to be written to the 
OS drive by mistake.
   - Bytes written to disk check file are not read back and checked.
   
   The following criteria are used to determine if a volume has failed. If 
anyone has suggestions for better heuristics please let me know.
   - Directory not existing is an immediate failure.
   - Inadequate permissions on the directory is an immediate failure.
   - Failure in the write, sync, read, check process on 3 consecutive volume 
scans will fail a volume.
     -  Consecutive volume scans will be at least 15 minutes apart. This can be 
configured
     - The size of the disk check file (100 bytes default) can be configured.
     - The number of consecutive failures that constitutes volume failure (3) 
can be configured.
   
   ## What is the link to the Apache JIRA
   
   HDDS-8782
   
   ## How was this patch tested?
   
   WIP
   - [x] New unit tests for creating and clearing the tmp directory that holds 
volume health check files added.
   - [x] Existing volume scanner tests
   - [ ] Unit tests for new disk health checks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to