Wei-Chiu Chuang created HDDS-15327:
--------------------------------------

             Summary: SCM does not proactively clear failed replication tasks
                 Key: HDDS-15327
                 URL: https://issues.apache.org/jira/browse/HDDS-15327
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
            Reporter: Wei-Chiu Chuang


Let's say SCM is configured to allow up to 200 replications 
(hdds.datanode.replication.outofservice.limit.factor = 2, 
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees 
at most 12 replication tasks (inflight+ queued)

 

  The reason SCM thinks 200 commands are active while the Datanode only sees 12 
is that SCM does not proactively clear failed replication commands.
   * Accounting Logic: When SCM sends a replication command, it increments its 
"in-flight" count. It only decrements this count if it receives a successful
     Container Report from the Datanode OR if the command times out.
   * The Leak: If a command fails on the Datanode (e.g., due to a temporary 
network blip, or the DN being busy), the Datanode sends a failure report.
     However, SCM's CommandStatusReportHandler currently ignores replication 
failure reports.
   * The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12 
minutes (the default for hdds.scm.replication.event.timeout).
   * Decommissioning Impact: Because decommissioning triggers a burst of 
thousands of commands, any small percentage of failures quickly "leaks" and 
fills
     up the 200-command quota with stale entries that won't disappear for 12 
minutes, blocking new progress.

 

The workaround might be to reduce hdds.scm.replication.event.timeout 
aggressively down to e.g. 1m.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to