Stephen O'Donnell created HDDS-7618:
---------------------------------------

             Summary: Replication Commands should timeout if not processed on 
datanodes in time
                 Key: HDDS-7618
                 URL: https://issues.apache.org/jira/browse/HDDS-7618
             Project: Apache Ozone
          Issue Type: Improvement
          Components: Ozone Datanode, SCM
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


The new and old replication manager sends commands to the datanodes. If the 
command has not processed on the datanodes within the replicationManager 
event.timeout, RM assumes the command has failed for some reason, and may send 
another one to the same or a different host.

It makes sense to drop any command not processed on the datanode slightly 
before ReplicationManager gives up on it. Especially with delete container 
commands, we don't want to have two or more deletes pending in the system for 
the same container, when RM thinks there is only 1.

To facilitate dropping the commands, we can add a deadline to all commands. 
Only for commands we want to enforce a deadline on, we can set the deadline in 
SCM and check for it on the DN side.

This change ensure that all commands sent to a datanode from RM will have a 
deadline set to 0.9 * event.timeout. On the datanode side, we only enforce the 
deadline on ReplicationContainer, DeleteContainer and ECReconstruction commands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to