[ 
https://issues.apache.org/jira/browse/HDDS-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ritesh H Shukla reassigned HDDS-6447:
-------------------------------------

    Assignee: Ethan Rose  (was: Hanisha Koneru)

> Refine SCM handling of unhealthy container replicas
> ---------------------------------------------------
>
>                 Key: HDDS-6447
>                 URL: https://issues.apache.org/jira/browse/HDDS-6447
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Hanisha Koneru
>            Assignee: Ethan Rose
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, containers are marked UNHEALTHY by Container Scrubber for one of 
> the following reasons:
>  # If an operation fails on an open/ closing container, it is marked 
> unhealthy so that subsequent write transactions also fail.
>  # If Container Scrubber is enabled and ContainerMetadataScanner detects an 
> error during KeyValueContainerCheck#fastCheck().
>  ** Metadata path or Chunks path is not accessible as a directory
>  ** Container checksum verification fails
>  ** On-disk Container Yaml data does not match in-memory container data 
> (ContainerType, ContainerID, Container DBType, Metadata Path)
>  # If Container Scrubber is enabled and ContainerDataScanner (runs only on 
> closed and quasi-closed containers) detects any block with missing or 
> corrupted chunks file.
> If a container in “open” state in SCM is marked unhealthy (in the container 
> report), SCM asks the DNs to close the container. But for a “closing” 
> container with an “unhealthy” replica, SCM leaves the container replica as is.
> Some of the issues with how unhealthy containers are handled:
>  # If ReplicationManager does not find a healthy replica for a container, it 
> does not replicate that container. So if there is only 1 replica of a 
> container and it is unhealthy, SCM will never replicate it and there is 
> potential for data loss if that single replica is lost for any reason (for 
> example: disk failure).
>  # If there is a _Quasi-Closed_ replica and an _Unhealthy_ container, SCM 
> will delete the unhealthy container. In this scenario, SCM should not delete 
> the unhealthy container if it can recovered as it is possible that the 
> unhealthy container is ahead of the quasi-closed container.
>  # SCM should be more conservative with deleting unhealthy containers as they 
> could possibly be recovered. This Jira proposes to let SCM replicate an 
> unhealthy container if there is no other replica. Also, if there is only a 
> quasi-closed replica and an unhealthy replica, SCM should not delete the 
> unhealthy replica.
>  # Let’s say there are 3 quasi-closed replicas of a closed container with all 
> of them having bcsId < container bcsId (closed replica is lost and a 
> quasi-closed replica is replicated). _RelicationManager_ will delete one of 
> these quesi-closed replicas ({_}handleUnstableContainer{_}) and then in the 
> next cycle replicate it again as container would now be under-replicated 
> ({_}handleUnderreplicatedContainer{_}). This will become a loop of 
> replicating and deleting the container replica.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to