Stephen O'Donnell created HDDS-7618:
---------------------------------------
Summary: Replication Commands should timeout if not processed on
datanodes in time
Key: HDDS-7618
URL: https://issues.apache.org/jira/browse/HDDS-7618
Project: Apache Ozone
Issue Type: Improvement
Components: Ozone Datanode, SCM
Reporter: Stephen O'Donnell
Assignee: Stephen O'Donnell
The new and old replication manager sends commands to the datanodes. If the
command has not processed on the datanodes within the replicationManager
event.timeout, RM assumes the command has failed for some reason, and may send
another one to the same or a different host.
It makes sense to drop any command not processed on the datanode slightly
before ReplicationManager gives up on it. Especially with delete container
commands, we don't want to have two or more deletes pending in the system for
the same container, when RM thinks there is only 1.
To facilitate dropping the commands, we can add a deadline to all commands.
Only for commands we want to enforce a deadline on, we can set the deadline in
SCM and check for it on the DN side.
This change ensure that all commands sent to a datanode from RM will have a
deadline set to 0.9 * event.timeout. On the datanode side, we only enforce the
deadline on ReplicationContainer, DeleteContainer and ECReconstruction commands.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]