xBis7 commented on PR #5651:
URL: https://github.com/apache/ozone/pull/5651#issuecomment-1857669374
@devmadhuu Here is a summary of the changes in this PR:
Recon and SCM are using the same naming and terminology regarding unhealthy
containers. The code in Recon was quite old and following a very simple process
of identifying each kind of unhealthy container, compared to the SCM. Recon was
ignoring the existence of offline nodes (decommission, maintenance) and was
also providing a different explanation for some terminologies.
The goal of this PR, is to make Recon container page consistent with the
ReplicationManager `admin container report`. To avoid duplicating the code or
its logic, I reused the same methods that the ReplicationManager is using for
identifying under-replication, over-replication etc. With these changes, Recon
is using the same algorithms as the RM for unhealthy containers. Every change
in the RM is carried over to Recon as well.
I think it's important to deal with common terminology, in the same manner,
to be consistent and avoid confusion. I'm referring to the way Recon defines
healthy and missing containers.
* healthy
* RM
* the container is closed or quasi-closed and all replicas are in the
same state
* Recon
* the container isn't over-replicated or under-replicated or
mis-replicated
* I didn't change the code but I renamed the method and all of its
references. This is actually referring to proper replication and not what RM
means by healthy.
* missing
* RM
* 0 replicas, no information or reference at all. An `admin container
info` call from the CLI, returns empty for all fields. If a container has only
1 unhealthy replica and we can get at least some info from it, like pipeline
ID, then the container isn't missing.
* Recon
* `ContainerHealthStatus` was getting a list of all the replicas and
then filtering it to get just the healthy ones. Then checking the healthy
replica list and if there are no healthy replicas, then the container is
missing.
* I modified it to use the initial replica list, for checking for
missing.
We can revert the healthy and missing changes to reduce the scope of this PR.
I've added an integration test, that places some nodes in decommission or
maintenance and makes a container go missing. The test, compares the RM `admin
container report` to Recon's response in every step and makes sure that the
numbers are the same and that RM's sample of containers exist in Recon's list
of unhealthy containers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]