Stephen O'Donnell created HDDS-8335:
---------------------------------------

             Summary: ReplicationManager: Review unhealthy handlers to deal 
with partial success and overloaded exceptions
                 Key: HDDS-8335
                 URL: https://issues.apache.org/jira/browse/HDDS-8335
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: SCM
            Reporter: Stephen O'Donnell


In RatisOverReplicationHandler and ECOverReplicationHandler, a container can be 
over replicated by several replicas, and the deletes are done in two stages:

1. First unhealthy replicas are removed.
2. Then healthy are removed.

While removing any replica, the handler could get a 
CommandTargetOverloadedException, but rather than throwing that exception 
immediately, it continues trying other replicas. At the end, if it has not 
deleted enough replicas, it re-throws the first 
CommandTargetOverloadedException so the over replication is re-queued on the 
over replication queue.

Other handlers also have multiple stages, but in the event of an error like 
CommandTargetOverloadedException, they give up immediately.

RatisOverReplicationHandler works as expected. So does ECOverReplicationHandler.

For RatisUnderReplicationHandler, as the command target is the source, and the 
RM.sentThrottleReplicationCommand() handles picking the lowest loaded source - 
it is possible to send one command, and then fail to send the second, but there 
is no point in retrying as it means all the sources are overloaded. As things 
stand, it will send what it can and then throw an exception, so that is fine.

For MisReplicationHandler, which is currently shared with EC and Ratis 
(HDDS-8109 may change this), I believe it could run into this problem with EC, 
where it may need to make a new copy of 2 EC indexes, and 1 of the nodes is 
overloaded and the other is not. It would be better to not fail completely if 
the first is overloaded.

For Ratis Mis Replication, as we can copy any replica after HDDS-8109 it should 
behave like the RatisUnderReplicationHandler after HDDS-8109.

For ECUnderReplicationHandler, there are multiple stages for processing and 
potential for partial success.

We should review both ECUnderReplicationHandler and EC MisReplication handling 
(after HDDS-8109) to handle overloaded exceptions and throw exceptions on 
partial success.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to