[PR] HDDS-15327. Proactively clear failed replication commands in SCM [ozone]

via GitHub Thu, 18 Jun 2026 04:06:26 -0700


chihsuan opened a new pull request, #10540:
URL: https://github.com/apache/ozone/pull/10540


   ## What changes were proposed in this pull request?
   
   When a replication or EC reconstruction command fails on a datanode 
(transient
   network issue, busy datanode, etc.), SCM is never told. The pending "ADD"
   operation stays in `ContainerReplicaPendingOps` and continues to count 
against
   the inflight replication accounting until the command's deadline expires, 
which
   defaults to `hdds.scm.replication.event.timeout` = 12 minutes.
   
   This has two effects:
   
   1. The cluster-wide inflight count 
(`ReplicationManager#getInflightReplicationCount`,
      gated by `UnderReplicatedProcessor`) fills up with stale entries, so SCM 
stops
      scheduling new replication even though the datanodes are idle.
   2. The specific under-replicated container is not re-scheduled, because the 
health
      check still sees a pending ADD for that replica.
   
   During decommission this is especially painful: thousands of commands are 
issued,
   and a small failure rate quickly leaks enough stale entries to stall 
progress for
   up to 12 minutes at a time.
   
   This PR makes SCM clear a failed replication/reconstruction op proactively, 
by
   re-introducing the command-status feedback path that was removed in 
HDDS-1368,
   **without any Protobuf/wire change** (the `CommandStatus` message already 
carries
   `FAILED`, `cmdId`, and `type`):
   
   - **Datanode** now reports `EXECUTED`/`FAILED` status for 
`replicateContainerCommand`
     and `reconstructECContainersCommand`, mirroring how `deleteBlocksCommand` 
already
     reports. `StateContext#addCmdStatus` registers a PENDING entry for these 
commands,
     `AbstractReplicationTask#getCommandId()` exposes the backing command id, 
and
     `ReplicationSupervisor.TaskRunner` marks the status when the task 
finishes. Tasks
     with no backing SCM command (e.g. reconcile) are unaffected.
   - **SCM** routes failed statuses to the pending-op store: 
`CommandStatusReportHandler`
     fires a new `REPLICATION_STATUS` event for failed replicate/reconstruct 
commands,
     and `StorageContainerManager` wires it to a new
     `ContainerReplicaPendingOps#onReplicationCommandFailed(cmdId)`, which 
looks the
     command up via a new `cmdId -> ContainerID` index and removes the matching 
ADD op
     (decrementing the inflight counter and freeing the scheduled size), so 
both effects
     above are resolved immediately instead of after the timeout.
   
   Compatibility degrades gracefully: an old datanode against a new SCM simply 
never
   sends the failure report and falls back to the existing 12-minute timeout; a 
new
   datanode against an old SCM has its replication status ignored as before.
   
   Follow-ups (intentionally out of scope here):
   - A `MiniOzoneCluster` decommission integration test that induces replication
     failures and asserts quota recovery.
   - Reporting status on the `TaskRunner` early-return paths (deadline passed / 
not in
     service / stale term) so PENDING entries are reclaimed sooner; this 
matches the
     existing `deleteBlocksCommand` behaviour and is tracked separately.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-15327
   
   ## How was this patch tested?
   
   New and updated unit tests:
   - `TestContainerReplicaPendingOps`: a failed command removes the matching 
ADD op and
     decrements the inflight counter; an unknown command id is a no-op.
   - `TestCommandStatusReportHandler`: a FAILED replication status fires 
`REPLICATION_STATUS`.
   - `TestStateContext`: replicate/reconstruct commands register a PENDING 
status.
   - `TestReplicationSupervisor`: a finished task reports `EXECUTED` on success 
and
     `FAILED` on failure.
   
   Local CI-aligned checks all pass: `checkstyle.sh`, `rat.sh`, `author.sh`.
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-15327. Proactively clear failed replication commands in SCM [ozone]

Reply via email to