errose28 opened a new pull request, #6506:
URL: https://github.com/apache/ozone/pull/6506
(WIP, still needs tests)
## What changes were proposed in this pull request?
A lot of boilerplate code to do something very simple:
- Tell SCM to start reconciliation for a container from the CLI.
- Have SCM tell Datanodes to reconcile that container with their peers.
- Datanodes send back a placeholder container data checksum which we can
fill in with reconciliation implementation later.
- SCM updates its replica info based on the container report received after
the Datanodes reconcile.
I've tried to avoid making any design related decisions in this PR. It is
intended as a skeleton we can use to plug in the reconciliation implementation
for end to end testing as we work on implementation.
### In scope for this change
- Add new `ozone admin container reconcile <container-id>` command.
- New command should be restricted to admins
- Audit logging for new command
- Blocking reconciliation of invalid containers (EC, 1 replica, still open)
- Datanode queue metrics for reconciliation commands
- Datanode and SCM application logs to follow the command as it moves
through the system.
- SCM saves container replicas' data checksums in memory, and they can be
retrieved with `ozone admin container info --json`
### Out of scope for this change (but will be handled in later tasks)
- Any actual checksum related implementations
- Currently strings are used as placeholders just to move filler data
around for testing.
- Recon integration with container data checksums
- This includes Recon's `ContainerReplicaHistoryProto`
- Finalized protobuf changes
- Since the change is going to a feature branch we have the flexibility to
evolve the protos later.
- Good UX 😄
- This includes flags for the `reconcile` command, an easy way to track
reconciliation progress, and reading containers from stdin like other
`container` subcommands support.
- These will need some discussion so are probably best done as their own
set of changes.
## What is the link to the Apache JIRA
HDDS-10372
## How was this patch tested?
Currently manual testing works, except that errors from the CLI are
currently treated as retriable. This might be larger problem with error
handling for SCM admin CLI in general, but I'm still investigating.
The follow automated tests need to be added before the PR is moved out of
draft:
- End to end testing in `admincli/container.robot`
- New command shows up in suggestions
- admin only
- invalid inputs
- e2e tests that checksum makes it back from the datanode to SCM
- Tests for SCM receiving the command
- Add new `TestReconcileContainerEventHandler` similar to
`TestCloseContainerEventHandler`
- Tests for Datanodes receiving the command
- Add new `TestReconcileContainerCommandHandler` similar to
`TestCloseContainerCommandHandler`
- Includes tests for `queueCount` and `invocationCount` metrics
- Test counts for the new command in
`TestStateContext#testCommandQueueSummary`
- Tests for Datanodes sending the value back in the heartbeat
- Add checks to `TestHeartbeatEndpoint`
- Tests for SCM receiving the container report
- Add checks to `TestContainerReportHandler`
- Add checks to `TestIncrementalContainerReportHandler`
- Check data checksum is present in replica output of
`TestInfoSubCommand`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]