[
https://issues.apache.org/jira/browse/HDDS-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chi-Hsuan Huang updated HDDS-15327:
-----------------------------------
Status: Patch Available (was: Open)
> SCM does not proactively clear failed replication tasks
> -------------------------------------------------------
>
> Key: HDDS-15327
> URL: https://issues.apache.org/jira/browse/HDDS-15327
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Wei-Chiu Chuang
> Assignee: Chi-Hsuan Huang
> Priority: Major
> Labels: pull-request-available
>
> Let's say SCM is configured to allow up to 200 replications
> (hdds.datanode.replication.outofservice.limit.factor = 2,
> hdds.scm.replication.datanode.replication.limit = 100), but Datanode only
> sees at most 12 replication tasks (inflight+ queued)
>
> The result? SCM does not push replication as hard as what is configured, and
> decommission becomes slow.
>
> The reason SCM thinks 200 commands are active while the Datanode only sees
> 12 is that SCM does not proactively clear failed replication commands.
> * Accounting Logic: When SCM sends a replication command, it increments
> its "in-flight" count. It only decrements this count if it receives a
> successful
> Container Report from the Datanode OR if the command times out.
> * The Leak: If a command fails on the Datanode (e.g., due to a temporary
> network blip, or the DN being busy), the Datanode sends a failure report.
> However, SCM's CommandStatusReportHandler currently ignores replication
> failure reports.
> * The Timeout: Those failed commands stay in SCM's "in-flight" quota for
> 12 minutes (the default for hdds.scm.replication.event.timeout).
> * Decommissioning Impact: Because decommissioning triggers a burst of
> thousands of commands, any small percentage of failures quickly "leaks" and
> fills
> up the 200-command quota with stale entries that won't disappear for 12
> minutes, blocking new progress.
>
> The workaround might be to reduce hdds.scm.replication.event.timeout
> aggressively down to e.g. 1m.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]