[
https://issues.apache.org/jira/browse/HDDS-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904968#comment-17904968
]
Ethan Rose commented on HDDS-11844:
-----------------------------------
bq. The actual reason for the Pipeline Safemode rule is to track
{{OPEN/CLOSING}} Containers. The {{OPEN/CLOSING}} {{Containers}} are managed by
{{Pipelines}}, the SCM doesn't really manage the replicas for {{OPEN/CLOSING}}
Containers. It relies on Ratis {{Pipeline}} to manage the replicas.
The existence of a pipeline does not guarantee the existence of a container.
Pipeline metadata is stored on a different volume from the container. Both
should be checked separately, despite what [this
comment|https://github.com/apache/ozone/blob/cba23a5f66a4fb7345db801cb1c8e2538146d4c4/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/OneReplicaPipelineSafeModeRule.java#L46]
in the current code implies. Once SCM is running, replication manager needs to
handle open and closing containers that lose replicas ands safemode should
operate in a similar manner.
bq. There can be containers in {{OPEN/CLOSING}} state in SCM which were never
created by the client on the Datanodes. If we include Containers in
{{OPEN/CLOSING}} state in Container Safemode rule, SCM might never come out of
Safemode
The replication manager handles this by checking the last known container key
count since [https://github.com/apache/ozone/pull/5523]. We probably need
something similar for safemode, where container safemode rule counts
open/closing containers that are known to not be empty.
I'm still confused what the expected behavior is when all there is widespread
metadata volume failure across datanodes and we lose a lot of pipeline
information SCM is expecting to see. If the cluster is restarted with new
metadata volumes, it should not require a force exit of safemode.
> Do not wait for all the Pipelines to be reported to exit SafeMode
> -----------------------------------------------------------------
>
> Key: HDDS-11844
> URL: https://issues.apache.org/jira/browse/HDDS-11844
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Nandakumar
> Priority: Major
>
> We don't have to wait for all the Pipelines to be reported to exit
> {{SafeMode}}. Having at least one open {{Pipeline}} to serve writes is enough
> to get out of {{SafeMode}}.
> We can reuse the {{Pipelines}} reported by {{Datanodes}}, but we don't have
> to wait for all the {{Pipelines}} to be reported to get SCM out of
> {{SafeMode}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]