errose28 opened a new pull request, #3920:
URL: https://github.com/apache/ozone/pull/3920
**Leaving as draft until all tests are passing and consensus is reached on
the proposed design decisions**
## What changes were proposed in this pull request?
This PR continues Hanisha's work from #3258, although it makes some changes
to the rules proposed there.
Updates how replication manager (RM) deals with quasi closed and unhealthy
replicas for Ratis containers only. Currently all unhealthy containers are
deleted. It is possible that the unhealthy container still has mostly good
data, just with a few corrupted blocks, and that we will have the ability to
recover unhealthy containers in the future. For this reason, this PR proposes
changing the replication manager to abide by the following rules:
- If the container is closed:
- If all replicas are unhealthy, they should be replicated like healthy
containers.
- This means the system should prioritize having 3 copies available,
and delete extra copies if over replication occurs.
- The unhealthy replicas to keep should be prioritized on highest
BCSID.
- If only some of the replicas are unhealthy:
- In iteration 1, RM should replicate the healthy replicas and
ignore the unhealthy replicas.
- In iteration 2, RM should delete the unhealthy replicas.
- If the container is not yet closed:
- Open containers remain excluded from RM.
- If all replicas are unhealthy:
- 3 replicas should be preserved. These will be chosen based on
unique BCSID.
- If there is a mix of unhealthy and quasi closed replicas:
- In iteration 1, RM should replicate the quasi closed containers so
that there are 3 replicas, ignoring the unhealthy replicas.
- Since SCM currently has no way of knowing whether a replica is
unhealthy due to a few block corruptions or complete container
corruption/volume loss, unhealthy replicas should not count towards the
container's durability.
- In iteration 2, RM should delete the unhealthy containers whose
BCSIDs are not represented in the healthy replicas.
- In the future if unhealthy replicas can be recovered, there is
potential to close these containers.
## What is the link to the Apache JIRA
HDDS-6447
## How was this patch tested?
Test were added and updated in `TestLegacyReplicationManager`. To aid in
reviewing, tests in that class were grouped in to nested classes based on
functionality. Tests in the `UnstableReplicas` class concern these changes, and
all should be reviewed for expected behavior even if they do not show up in the
diff. Tests in other classes were just relocated. For reviewers of this file I
would recommend verifying that all tests in the original
`TestLegacyReplicationManager` are still present and passing, as the diff for
this refactor is quite messy.
## Criteria for removing from draft
- [ ] Consensus on design decisions
- [ ] All tests passing
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]