devmadhuu commented on PR #5651: URL: https://github.com/apache/ozone/pull/5651#issuecomment-1864299847
> @devmadhuu Here is a summary of the changes in this PR: > > Recon and SCM are using the same naming and terminology regarding unhealthy containers. The code in Recon was quite old and following a very simple process of identifying each kind of unhealthy container, compared to the SCM. Recon was ignoring the existence of offline nodes (decommission, maintenance) and was also providing a different explanation for some terminologies. > > The goal of this PR, is to make Recon container page consistent with the ReplicationManager `admin container report`. To avoid duplicating the code or its logic, I reused the same methods that the ReplicationManager is using for identifying under-replication, over-replication etc. With these changes, Recon is using the same algorithms as the RM for unhealthy containers. Every change in the RM is carried over to Recon as well. > > I think it's important to deal with common terminology, in the same manner, to be consistent and avoid confusion. I'm referring to the way Recon defines healthy and missing containers. > > * healthy > > * RM > > * the container is closed or quasi-closed and all replicas are in the same state > * Recon > > * the container isn't over-replicated or under-replicated or mis-replicated > * I didn't change the code but I renamed the method and all of its references. This is actually referring to proper replication and not what RM means by healthy. > > * missing > > * RM > > * 0 replicas, no information or reference at all. An `admin container info` call from the CLI, returns empty for all fields. If a container has only 1 unhealthy replica and we can get at least some info from it, like pipeline ID, then the container isn't missing. > * Recon > > * `ContainerHealthStatus` was getting a list of all the replicas and then filtering it to get just the healthy ones. Then checking the healthy replica list and if there are no healthy replicas, then the container is missing. > * I modified it to use the initial replica list, healthy and unhealthy. > > > We can revert the healthy and missing changes to reduce the scope of this PR. > > I've added an integration test, that places some nodes in decommission or maintenance and makes a container go missing. The test, compares the RM `admin container report` to Recon's response in every step and makes sure that the numbers are the same and that RM's sample of containers exist in Recon's list of unhealthy containers. Thanks @xBis7 for your effort on this patch. I think this looks much better in current state and regarding EC handling of containers, this can be taken up separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
