Sergey Soldatov created HDDS-14674:
--------------------------------------
Summary: Node with existing QUASI_CLOSED replica can be wrongly
selected as replication target
Key: HDDS-14674
URL: https://issues.apache.org/jira/browse/HDDS-14674
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Affects Versions: 2.1.0
Reporter: Sergey Soldatov
Assignee: Sergey Soldatov
During RATIS under-replication handling (vulnerable/unhealthy path), SCM can
lose visibility of some existing replicas before target selection. As a result,
a DN that already has a replica of the same container may be incorrectly
considered eligible as a new target.
Why does it happen:
In RatisUnderReplicationHandler.processAndSendCommands(...) we create 2
counters:
withUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas,
pendingOps, ..., true)
withoutUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas,
pendingOps, ..., false)
if we have vulnerable/unhealthy replicas we call
{*}handleVulnerableUnhealthyReplicas{*}(withUnhealthy, pendingOps)
Inside we calls withUnhealthy.{*}getVulnerableUnhealthyReplicas{*}(...) that
mutates the internal field *replicas* via replicas.removeIf(...)
So *withUnhealthy* object now has a modified internal replica list.
After that, we call
replicateEachSource({*}withUnhealthy{*}, vulnerableUnhealthy, pendingOps)
where we do the following:
*allReplicas* = {*}withUnhealthy{*}.getReplicas()
ReplicationManagerUtil.getExcludedAndUsedNodes(container,
{*}allReplicas{*}, ...)
As a result, some existing replica hosts (non-healthy/stale ones) may be
missing from placement inputs. This can allow a DN that already hosts a replica
to be considered as a replication target.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]