Stephen O'Donnell created HDDS-8660:
---------------------------------------
Summary: ReplicationManager: Notify when dead nodes or nodes go
out of service
Key: HDDS-8660
URL: https://issues.apache.org/jira/browse/HDDS-8660
Project: Apache Ozone
Issue Type: Sub-task
Components: SCM
Reporter: Stephen O'Donnell
Ff someone triggers decommission / maintenance, there is potentially a 5 minute
lag from the decommission process starting and RM noticing that containers need
replication, due to RM running on a 5 minute interval. Similarly, if a node
goes dead, it has already been gone for 10 minutes, and it will take up to
another 5 minutes for RM to notice and process the containers.
It would be good to notify the RM thread to wake it up when these events happen
to reduce the time it takes to start to repair the problem.
One thing that comes to mind about for any solution, is that RM operates by:
1. Getting a list of all containers.
2. Processing the list
3. Sleeping for 5 minutes.
If a dead node happens at during step 2, and we notify the thread, it will
already be running so the notify will not do anything. It may be that some of
the containers from the node in question have been processed already, or they
may still to be processed - we don't really know. Perhaps this is OK, rather
than complicating the solution, as in general fixing decommission or
under-replication will take a long time.
It is also possible that several nodes go dead in quick succession, or several
nodes go out of service quickly, resulting in several notify calls occurring.
We don't want to wake up the thread too frequently if this happens, as it will
result in a new replication queue getting created over and over. Perhaps if the
queue is not empty, then there is replication work to do, and we should not run
again.
Finally, we might want to consider notifying on a node coming back into
service, as that could cause over-replication. However over-replication is not
as big of a problem as under-replication if it is not addressed quickly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]