[jira] [Commented] (HDDS-8377) Do not count containers with all unhealthy replicas as missing in Recon

Devesh Kumar Singh (Jira) Thu, 06 Apr 2023 07:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709419#comment-17709419
 ]


Devesh Kumar Singh commented on HDDS-8377:
------------------------------------------

[~erose]  - I double checked the code in Recon ,  that how it is reporting 
missing containers, It will never report a container as Missing unless until  
the replica count for a given container is absolutely ZERO. For a container if 
all replicas are just UNHEALTHY, Recon will NOT report that container as 
Missing. I even tried to reproduce by making all replicas as UNHEALTHY on all 3 
DNs... but they were being reported as UNDER_REPLICATED and not MISSING.

!image-2023-04-06-20-19-45-628.png|width=823,height=188!

 

Recon reports MISSING container based on following condition in code:

!image-2023-04-06-20-20-56-472.png|width=279,height=98!

and this "numReplicas" is being populated when replica count  size is 0 for a 
given container...

!image-2023-04-06-20-21-25-423.png|width=834,height=242!

 

Now, IMO in case of BofA scenario, this can happen if those UNHEALTHY replicas 
were not being reported in container report from those all DNs.. then our 
container report handler will delete those replicas from container state 
manager cache and replicas will be ZERO for that container.
If in case, replicas for that container again started being reported from DNs 
in next container report heartbeat, then they become shown as UNDER_REPLICATED 
and NOT MISSING. DNs may delay container report if they might be busy and at 
the time when you ran {{ozone admin container info}}  , there is a possibility 
that those replicas were in UNHEALTHY state and container report from such DNs 
still not received by SCM/Recon and in that container report heartbeat , our 
SCM/Recon cache might have removed those replicas

> Do not count containers with all unhealthy replicas as missing in Recon
> -----------------------------------------------------------------------
>
>                 Key: HDDS-8377
>                 URL: https://issues.apache.org/jira/browse/HDDS-8377
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Recon
>            Reporter: Ethan Rose
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>         Attachments: image-2023-04-06-20-19-45-628.png, 
> image-2023-04-06-20-20-56-472.png, image-2023-04-06-20-21-25-423.png
>
>
> Currently, if all replicas of a container are unhealthy, Recon will consider 
> that container missing. When all replicas are unhealthy, some or all of the 
> data in those containers could still be readable. This is different from all 
> replicas being lost, where all the data is definitely not readable. This jira 
> proposes adding a new category to Recon for containers where all replicas are 
> unhealthy, and to not reuse the missing category for these containers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-8377) Do not count containers with all unhealthy replicas as missing in Recon

Reply via email to