[ 
https://issues.apache.org/jira/browse/HDDS-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chi-Hsuan Huang updated HDDS-15327:
-----------------------------------
    Status: Patch Available  (was: Open)

> SCM does not proactively clear failed replication tasks
> -------------------------------------------------------
>
>                 Key: HDDS-15327
>                 URL: https://issues.apache.org/jira/browse/HDDS-15327
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Wei-Chiu Chuang
>            Assignee: Chi-Hsuan Huang
>            Priority: Major
>              Labels: pull-request-available
>
> Let's say SCM is configured to allow up to 200 replications 
> (hdds.datanode.replication.outofservice.limit.factor = 2, 
> hdds.scm.replication.datanode.replication.limit = 100), but Datanode only 
> sees at most 12 replication tasks (inflight+ queued)
>  
> The result? SCM does not push replication as hard as what is configured, and 
> decommission becomes slow.
>  
>   The reason SCM thinks 200 commands are active while the Datanode only sees 
> 12 is that SCM does not proactively clear failed replication commands.
>    * Accounting Logic: When SCM sends a replication command, it increments 
> its "in-flight" count. It only decrements this count if it receives a 
> successful
>      Container Report from the Datanode OR if the command times out.
>    * The Leak: If a command fails on the Datanode (e.g., due to a temporary 
> network blip, or the DN being busy), the Datanode sends a failure report.
>      However, SCM's CommandStatusReportHandler currently ignores replication 
> failure reports.
>    * The Timeout: Those failed commands stay in SCM's "in-flight" quota for 
> 12 minutes (the default for hdds.scm.replication.event.timeout).
>    * Decommissioning Impact: Because decommissioning triggers a burst of 
> thousands of commands, any small percentage of failures quickly "leaks" and 
> fills
>      up the 200-command quota with stale entries that won't disappear for 12 
> minutes, blocking new progress.
>  
> The workaround might be to reduce hdds.scm.replication.event.timeout 
> aggressively down to e.g. 1m.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to