[
https://issues.apache.org/jira/browse/HDDS-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-8660:
---------------------------------
Labels: pull-request-available (was: )
> ReplicationManager: Notify when dead nodes or nodes go out of service
> ---------------------------------------------------------------------
>
> Key: HDDS-8660
> URL: https://issues.apache.org/jira/browse/HDDS-8660
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: SCM
> Reporter: Stephen O'Donnell
> Assignee: Peter Lee
> Priority: Major
> Labels: pull-request-available
>
> If someone triggers decommission / maintenance, there is potentially a 5
> minute lag from the decommission process starting and RM noticing that
> containers need replication, due to RM running on a 5 minute interval.
> Similarly, if a node goes dead, it has already been gone for 10 minutes, and
> it will take up to another 5 minutes for RM to notice and process the
> containers.
> It would be good to notify the RM thread to wake it up when these events
> happen to reduce the time it takes to start to repair the problem.
> One thing that comes to mind about for any solution, is that RM operates by:
> 1. Getting a list of all containers.
> 2. Processing the list
> 3. Sleeping for 5 minutes.
> If a dead node happens at during step 2, and we notify the thread, it will
> already be running so the notify will not do anything. It may be that some of
> the containers from the node in question have been processed already, or they
> may still to be processed - we don't really know. Perhaps this is OK, rather
> than complicating the solution, as in general fixing decommission or
> under-replication will take a long time.
> It is also possible that several nodes go dead in quick succession, or
> several nodes go out of service quickly, resulting in several notify calls
> occurring. We don't want to wake up the thread too frequently if this
> happens, as it will result in a new replication queue getting created over
> and over. Perhaps if the queue is not empty, then there is replication work
> to do, and we should not run again.
> Finally, we might want to consider notifying on a node coming back into
> service, as that could cause over-replication. However over-replication is
> not as big of a problem as under-replication if it is not addressed quickly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]