[
https://issues.apache.org/jira/browse/HDDS-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ritesh H Shukla reassigned HDDS-6447:
-------------------------------------
Assignee: Ethan Rose (was: Hanisha Koneru)
> Refine SCM handling of unhealthy container replicas
> ---------------------------------------------------
>
> Key: HDDS-6447
> URL: https://issues.apache.org/jira/browse/HDDS-6447
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM
> Reporter: Hanisha Koneru
> Assignee: Ethan Rose
> Priority: Major
> Labels: pull-request-available
>
> Currently, containers are marked UNHEALTHY by Container Scrubber for one of
> the following reasons:
> # If an operation fails on an open/ closing container, it is marked
> unhealthy so that subsequent write transactions also fail.
> # If Container Scrubber is enabled and ContainerMetadataScanner detects an
> error during KeyValueContainerCheck#fastCheck().
> ** Metadata path or Chunks path is not accessible as a directory
> ** Container checksum verification fails
> ** On-disk Container Yaml data does not match in-memory container data
> (ContainerType, ContainerID, Container DBType, Metadata Path)
> # If Container Scrubber is enabled and ContainerDataScanner (runs only on
> closed and quasi-closed containers) detects any block with missing or
> corrupted chunks file.
> If a container in “open” state in SCM is marked unhealthy (in the container
> report), SCM asks the DNs to close the container. But for a “closing”
> container with an “unhealthy” replica, SCM leaves the container replica as is.
> Some of the issues with how unhealthy containers are handled:
> # If ReplicationManager does not find a healthy replica for a container, it
> does not replicate that container. So if there is only 1 replica of a
> container and it is unhealthy, SCM will never replicate it and there is
> potential for data loss if that single replica is lost for any reason (for
> example: disk failure).
> # If there is a _Quasi-Closed_ replica and an _Unhealthy_ container, SCM
> will delete the unhealthy container. In this scenario, SCM should not delete
> the unhealthy container if it can recovered as it is possible that the
> unhealthy container is ahead of the quasi-closed container.
> # SCM should be more conservative with deleting unhealthy containers as they
> could possibly be recovered. This Jira proposes to let SCM replicate an
> unhealthy container if there is no other replica. Also, if there is only a
> quasi-closed replica and an unhealthy replica, SCM should not delete the
> unhealthy replica.
> # Let’s say there are 3 quasi-closed replicas of a closed container with all
> of them having bcsId < container bcsId (closed replica is lost and a
> quasi-closed replica is replicated). _RelicationManager_ will delete one of
> these quesi-closed replicas ({_}handleUnstableContainer{_}) and then in the
> next cycle replicate it again as container would now be under-replicated
> ({_}handleUnderreplicatedContainer{_}). This will become a loop of
> replicating and deleting the container replica.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]