Stephen O'Donnell created HDDS-8335:
---------------------------------------
Summary: ReplicationManager: Review unhealthy handlers to deal
with partial success and overloaded exceptions
Key: HDDS-8335
URL: https://issues.apache.org/jira/browse/HDDS-8335
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Reporter: Stephen O'Donnell
In RatisOverReplicationHandler and ECOverReplicationHandler, a container can be
over replicated by several replicas, and the deletes are done in two stages:
1. First unhealthy replicas are removed.
2. Then healthy are removed.
While removing any replica, the handler could get a
CommandTargetOverloadedException, but rather than throwing that exception
immediately, it continues trying other replicas. At the end, if it has not
deleted enough replicas, it re-throws the first
CommandTargetOverloadedException so the over replication is re-queued on the
over replication queue.
Other handlers also have multiple stages, but in the event of an error like
CommandTargetOverloadedException, they give up immediately.
RatisOverReplicationHandler works as expected. So does ECOverReplicationHandler.
For RatisUnderReplicationHandler, as the command target is the source, and the
RM.sentThrottleReplicationCommand() handles picking the lowest loaded source -
it is possible to send one command, and then fail to send the second, but there
is no point in retrying as it means all the sources are overloaded. As things
stand, it will send what it can and then throw an exception, so that is fine.
For MisReplicationHandler, which is currently shared with EC and Ratis
(HDDS-8109 may change this), I believe it could run into this problem with EC,
where it may need to make a new copy of 2 EC indexes, and 1 of the nodes is
overloaded and the other is not. It would be better to not fail completely if
the first is overloaded.
For Ratis Mis Replication, as we can copy any replica after HDDS-8109 it should
behave like the RatisUnderReplicationHandler after HDDS-8109.
For ECUnderReplicationHandler, there are multiple stages for processing and
potential for partial success.
We should review both ECUnderReplicationHandler and EC MisReplication handling
(after HDDS-8109) to handle overloaded exceptions and throw exceptions on
partial success.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]