Stephen O'Donnell created HDDS-8660:
---------------------------------------

             Summary: ReplicationManager: Notify when dead nodes or nodes go 
out of service
                 Key: HDDS-8660
                 URL: https://issues.apache.org/jira/browse/HDDS-8660
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: SCM
            Reporter: Stephen O'Donnell


Ff someone triggers decommission / maintenance, there is potentially a 5 minute 
lag from the decommission process starting and RM noticing that containers need 
replication, due to RM running on a 5 minute interval. Similarly, if a node 
goes dead, it has already been gone for 10 minutes, and it will take up to 
another 5 minutes for RM to notice and process the containers.

It would be good to notify the RM thread to wake it up when these events happen 
to reduce the time it takes to start to repair the problem.

One thing that comes to mind about for any solution, is that RM operates by:

1. Getting a list of all containers.
2. Processing the list
3. Sleeping for 5 minutes.

If a dead node happens at during step 2, and we notify the thread, it will 
already be running so the notify will not do anything. It may be that some of 
the containers from the node in question have been processed already, or they 
may still to be processed - we don't really know. Perhaps this is OK, rather 
than complicating the solution, as in general fixing decommission or 
under-replication will take a long time.

It is also possible that several nodes go dead in quick succession, or several 
nodes go out of service quickly, resulting in several notify calls occurring. 
We don't want to wake up the thread too frequently if this happens, as it will 
result in a new replication queue getting created over and over. Perhaps if the 
queue is not empty, then there is replication work to do, and we should not run 
again.

Finally, we might want to consider notifying on a node coming back into 
service, as that could cause over-replication. However over-replication is not 
as big of a problem as under-replication if it is not addressed quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to