[jira] [Updated] (HDDS-14674) Node with existing QUASI_CLOSED replica can be wrongly selected as replication target

Sergey Soldatov (Jira) Thu, 19 Feb 2026 10:57:15 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-14674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Soldatov updated HDDS-14674:
-----------------------------------
    Description: 
During RATIS under-replication handling (vulnerable/unhealthy path), SCM can 
lose visibility of some existing replicas before target selection. As a result, 
a DN that already has a replica of the same container may be incorrectly 
considered eligible as a new target.

Why does it happen:

In RatisUnderReplicationHandler.processAndSendCommands(...) we create 2 
counters:
*withUnhealthy* = new RatisContainerReplicaCount(containerInfo, replicas, 
pendingOps, ..., true)
withoutUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas, 
pendingOps, ..., false)

if we have vulnerable/unhealthy replicas we call
{*}handleVulnerableUnhealthyReplicas{*}({*}withUnhealthy{*}, pendingOps) 

Inside  we calls {*}withUnhealthy{*}.{*}getVulnerableUnhealthyReplicas{*}(...) 
that mutates the internal field *replicas* via replicas.removeIf(...)

   So *withUnhealthy* object now has a modified internal replica list.

After that, we call 

replicateEachSource({*}withUnhealthy{*}, vulnerableUnhealthy, pendingOps)

where we do the following:
     *allReplicas* = {*}withUnhealthy{*}.getReplicas()
     ReplicationManagerUtil.getExcludedAndUsedNodes(container, 
{*}allReplicas{*}, ...)

As a result, some existing replica hosts (non-healthy/stale ones) may be 
missing from placement inputs. This can allow a DN that already hosts a replica 
to be considered as a replication target.

 

  was:
During RATIS under-replication handling (vulnerable/unhealthy path), SCM can 
lose visibility of some existing replicas before target selection. As a result, 
a DN that already has a replica of the same container may be incorrectly 
considered eligible as a new target.

Why does it happen:

In RatisUnderReplicationHandler.processAndSendCommands(...) we create 2 
counters:
 withUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas, 
pendingOps, ..., true)
withoutUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas, 
pendingOps, ..., false)

if we have vulnerable/unhealthy replicas we call
{*}handleVulnerableUnhealthyReplicas{*}(withUnhealthy, pendingOps) 

Inside  we calls withUnhealthy.{*}getVulnerableUnhealthyReplicas{*}(...) that 
mutates the internal field *replicas* via replicas.removeIf(...)

   So *withUnhealthy* object now has a modified internal replica list.

After that, we call 

replicateEachSource({*}withUnhealthy{*}, vulnerableUnhealthy, pendingOps)

where we do the following:
     *allReplicas* = {*}withUnhealthy{*}.getReplicas()
     ReplicationManagerUtil.getExcludedAndUsedNodes(container, 
{*}allReplicas{*}, ...)

As a result, some existing replica hosts (non-healthy/stale ones) may be 
missing from placement inputs. This can allow a DN that already hosts a replica 
to be considered as a replication target.

 


> Node with existing QUASI_CLOSED replica can be wrongly selected as 
> replication target
> -------------------------------------------------------------------------------------
>
>                 Key: HDDS-14674
>                 URL: https://issues.apache.org/jira/browse/HDDS-14674
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 2.1.0
>            Reporter: Sergey Soldatov
>            Assignee: Sergey Soldatov
>            Priority: Major
>
> During RATIS under-replication handling (vulnerable/unhealthy path), SCM can 
> lose visibility of some existing replicas before target selection. As a 
> result, a DN that already has a replica of the same container may be 
> incorrectly considered eligible as a new target.
> Why does it happen:
> In RatisUnderReplicationHandler.processAndSendCommands(...) we create 2 
> counters:
> *withUnhealthy* = new RatisContainerReplicaCount(containerInfo, replicas, 
> pendingOps, ..., true)
> withoutUnhealthy = new RatisContainerReplicaCount(containerInfo, replicas, 
> pendingOps, ..., false)
> if we have vulnerable/unhealthy replicas we call
> {*}handleVulnerableUnhealthyReplicas{*}({*}withUnhealthy{*}, pendingOps) 
> Inside  we calls 
> {*}withUnhealthy{*}.{*}getVulnerableUnhealthyReplicas{*}(...) that mutates 
> the internal field *replicas* via replicas.removeIf(...)
>    So *withUnhealthy* object now has a modified internal replica list.
> After that, we call 
> replicateEachSource({*}withUnhealthy{*}, vulnerableUnhealthy, pendingOps)
> where we do the following:
>      *allReplicas* = {*}withUnhealthy{*}.getReplicas()
>      ReplicationManagerUtil.getExcludedAndUsedNodes(container, 
> {*}allReplicas{*}, ...)
> As a result, some existing replica hosts (non-healthy/stale ones) may be 
> missing from placement inputs. This can allow a DN that already hosts a 
> replica to be considered as a replication target.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14674) Node with existing QUASI_CLOSED replica can be wrongly selected as replication target

Reply via email to