[PR] HDDS-11290. Container scanner should keep scanning after non-fatal errors [ozone]

via GitHub Tue, 27 Aug 2024 16:44:21 -0700


errose28 opened a new pull request, #7127:
URL: https://github.com/apache/ozone/pull/7127

## What changes were proposed in this pull request?

### Motivation

In order to build the merkle tree of all data in the container, the scanner
should not exit after the first issue it encounters like it does currently. The
scanner should track and return all errors that it sees, and only stop the scan
on fatal errors that prevent further scanning of the container, like DB access
errors.

This PR is a pre-requisite to HDDS-10374. It does not actually generate a
merkle tree during the scan and is also not testing this functionality. It sets
up HDDS-10374 to be an easy drop in to the scanner which will allow the focus
of that change to be testing of merkle tree generation.

### Primary Changes

Previously `ScanResult` was an object that encapsulated a singe error. This
was the first error the scanner saw which would abort the scan. This change
decouples the `ScanResult` from the errors, which are now represented by a list
of`ContainerScanError`s in the `ScanResult`.
- `ScanResult` is a general interface that can represent a data or metadata
scan. It can be logged by entities like the `ContainerLogger` which do not care
where the unhealthy result came from.
- `MetadataScanResult` is a `ScanResult` implementation produced from a
container metadata scan.
- This scan will not produce a merkle tree since it does not check data.
- `DataScanResult` extends `MetadataScanResult` by adding a merkle tree
representing the data that was scanned.
- All data scans begin with a metadata scan, and then proceed to scan the
data only if the metadata scan succeeds.

### Secondary Changes

- General cleanup of `KeyValueContainerCheck` internals were done since this
PR already required invasive changes in this area.
- Fixed a bug from #5485/HDDS-9005 where the DB was being used to check for
container deletion during a scan instead of the container state in memory.
- This would fail to detect a schema v2 container which is completely lost
but not actually deleted. The container would remain in the datanode's memory
as healthy.
- Since the [memory state is updated
first](https://github.com/apache/ozone/blob/769a9aad1fe8c936c5dc5d5019ca4ed61644c042/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java#L1512)
and our container object will still be valid even after being removed from the
[ContainerSet](https://github.com/apache/ozone/blob/769a9aad1fe8c936c5dc5d5019ca4ed61644c042/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java#L114),
we can use the `DELETED` state as the deletion check which is both safer and
simpler.
- When a container is deleted during a scan, this is indicated with a
different field in the `ScanResult`. It is no longer considered an error and
will not cause the `ScanResult` to be unhealthy.

### Notes to Reviewers

It is probably best to review the scan flow end-to-end instead in addition
to just viewing the diff.

## What is the link to the Apache JIRA

HDDS-11290

## How was this patch tested?

The scanner has a lot of existing tests that should all pass ensuring no
regressions (Still WIP):
- `TestKeyValueContainerCheck`
- `TestKeyValueHandlerWithUnhealthyContainer`
- `Test{Background,OnDemand}Container{Data,Metadata}Scanner`
- `Test{Background,OnDemand}Container{Data,Metadata}ScannerIntegration`

New tests added for detecting multiple errors:
- WIP

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-11290. Container scanner should keep scanning after non-fatal errors [ozone]

Reply via email to