errose28 opened a new pull request, #4867:
URL: https://github.com/apache/ozone/pull/4867
This PR incorporates changes from #4838 so it has a place to write files
used to check disk health. Leaving this as a draft until that PR is merged,
which should shrink the diff considerably.
## What changes were proposed in this pull request?
The volume scanner/disk checker currently only checks filesystem permissions
and directory existence. It should also do write, sync, and read back from a
file as well to touch the actual hardware and not just information in the OS
cache.
This PR switches from using the `DiskChecker` class from Hadoop to a new
Ozone specific `DiskCheckUtil` that we have more control over. The Hadoop
implementation was removed for the following reasons:
- It does not reliably preserve the cause of failure. Many operations are
done using methods from java.io instead of java.nio, so booleans are turned on
failure instead of exceptions with error messages.
- Lack of configuration in disk check files
- The number of iterations performed and the size of the file are not
configurable.
- This PR omits the iterations feature, instead relying on consecutive
volume scans.
- It creates directories if they do not exist.
- This could mask a missing mountpoint and cause data to be written to the
OS drive by mistake.
- Bytes written to disk check file are not read back and checked.
The following criteria are used to determine if a volume has failed. If
anyone has suggestions for better heuristics please let me know.
- Directory not existing is an immediate failure.
- Inadequate permissions on the directory is an immediate failure.
- Failure in the write, sync, read, check process on 3 consecutive volume
scans will fail a volume.
- Consecutive volume scans will be at least 15 minutes apart. This can be
configured
- The size of the disk check file (100 bytes default) can be configured.
- The number of consecutive failures that constitutes volume failure (3)
can be configured.
## What is the link to the Apache JIRA
HDDS-8782
## How was this patch tested?
WIP
- [x] New unit tests for creating and clearing the tmp directory that holds
volume health check files added.
- [x] Existing volume scanner tests
- [ ] Unit tests for new disk health checks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]