Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

via GitHub Fri, 15 Dec 2023 02:48:11 -0800


xBis7 commented on PR #5651:
URL: https://github.com/apache/ozone/pull/5651#issuecomment-1857669374


   @devmadhuu Here is a summary of the changes in this PR:
   
   Recon and SCM are using the same naming and terminology regarding unhealthy 
containers. The code in Recon was quite old and following a very simple process 
of identifying each kind of unhealthy container, compared to the SCM. Recon was 
ignoring the existence of offline nodes (decommission, maintenance) and was 
also providing a different explanation for some terminologies.
   
   The goal of this PR, is to make Recon container page consistent with the 
ReplicationManager `admin container report`. To avoid duplicating the code or 
its logic, I reused the same methods that the ReplicationManager is using for 
identifying under-replication, over-replication etc. With these changes, Recon 
is using the same algorithms as the RM for unhealthy containers. Every change 
in the RM is carried over to Recon as well. 
   
   I think it's important to deal with common terminology, in the same manner, 
to be consistent and avoid confusion. I'm referring to the way Recon defines 
healthy and missing containers.
   
   * healthy
     * RM
       *  the container is closed or quasi-closed and all replicas are in the 
same state
     * Recon
       *  the container isn't over-replicated or under-replicated or 
mis-replicated 
       * I didn't change the code but I renamed the method and all of its 
references. This is actually referring to proper replication and not what RM 
means by healthy.
   * missing
     * RM
       * 0 replicas, no information or reference at all. An `admin container 
info` call from the CLI, returns empty for all fields. If a container has only 
1 unhealthy replica and we can get at least some info from it, like pipeline 
ID, then the container isn't missing.
     * Recon
       * `ContainerHealthStatus` was getting a list of all the replicas and 
then filtering it to get just the healthy ones. Then checking the healthy 
replica list and if there are no healthy replicas, then the container is 
missing.
       * I modified it to use the initial replica list, for checking for 
missing.
   
   We can revert the healthy and missing changes to reduce the scope of this PR.
   
   I've added an integration test, that places some nodes in decommission or 
maintenance and makes a container go missing. The test, compares the RM `admin 
container report` to Recon's response in every step and makes sure that the 
numbers are the same and that RM's sample of containers exist in Recon's list 
of unhealthy containers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-9645. Recon doesn't exclude out-of-service nodes when checking for healthy containers [ozone]

Reply via email to