Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

via GitHub Wed, 20 Dec 2023 03:17:50 -0800


devmadhuu commented on PR #5651:
URL: https://github.com/apache/ozone/pull/5651#issuecomment-1864299847


   > @devmadhuu Here is a summary of the changes in this PR:
   > 
   > Recon and SCM are using the same naming and terminology regarding 
unhealthy containers. The code in Recon was quite old and following a very 
simple process of identifying each kind of unhealthy container, compared to the 
SCM. Recon was ignoring the existence of offline nodes (decommission, 
maintenance) and was also providing a different explanation for some 
terminologies.
   > 
   > The goal of this PR, is to make Recon container page consistent with the 
ReplicationManager `admin container report`. To avoid duplicating the code or 
its logic, I reused the same methods that the ReplicationManager is using for 
identifying under-replication, over-replication etc. With these changes, Recon 
is using the same algorithms as the RM for unhealthy containers. Every change 
in the RM is carried over to Recon as well.
   > 
   > I think it's important to deal with common terminology, in the same 
manner, to be consistent and avoid confusion. I'm referring to the way Recon 
defines healthy and missing containers.
   > 
   >     * healthy
   >       
   >       * RM
   >         
   >         * the container is closed or quasi-closed and all replicas are in 
the same state
   >       * Recon
   >         
   >         * the container isn't over-replicated or under-replicated or 
mis-replicated
   >         * I didn't change the code but I renamed the method and all of its 
references. This is actually referring to proper replication and not what RM 
means by healthy.
   > 
   >     * missing
   >       
   >       * RM
   >         
   >         * 0 replicas, no information or reference at all. An `admin 
container info` call from the CLI, returns empty for all fields. If a container 
has only 1 unhealthy replica and we can get at least some info from it, like 
pipeline ID, then the container isn't missing.
   >       * Recon
   >         
   >         * `ContainerHealthStatus` was getting a list of all the replicas 
and then filtering it to get just the healthy ones. Then checking the healthy 
replica list and if there are no healthy replicas, then the container is 
missing.
   >         * I modified it to use the initial replica list, healthy and 
unhealthy.
   > 
   > 
   > We can revert the healthy and missing changes to reduce the scope of this 
PR.
   > 
   > I've added an integration test, that places some nodes in decommission or 
maintenance and makes a container go missing. The test, compares the RM `admin 
container report` to Recon's response in every step and makes sure that the 
numbers are the same and that RM's sample of containers exist in Recon's list 
of unhealthy containers.
   
   Thanks @xBis7 for your effort on this patch. I think this looks much better 
in current state and regarding EC handling of containers, this can be taken up 
separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

Reply via email to