[ 
https://issues.apache.org/jira/browse/HDDS-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-8660:
---------------------------------
    Labels: pull-request-available  (was: )

> ReplicationManager: Notify when dead nodes or nodes go out of service
> ---------------------------------------------------------------------
>
>                 Key: HDDS-8660
>                 URL: https://issues.apache.org/jira/browse/HDDS-8660
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Stephen O'Donnell
>            Assignee: Peter Lee
>            Priority: Major
>              Labels: pull-request-available
>
> If someone triggers decommission / maintenance, there is potentially a 5 
> minute lag from the decommission process starting and RM noticing that 
> containers need replication, due to RM running on a 5 minute interval. 
> Similarly, if a node goes dead, it has already been gone for 10 minutes, and 
> it will take up to another 5 minutes for RM to notice and process the 
> containers.
> It would be good to notify the RM thread to wake it up when these events 
> happen to reduce the time it takes to start to repair the problem.
> One thing that comes to mind about for any solution, is that RM operates by:
> 1. Getting a list of all containers.
> 2. Processing the list
> 3. Sleeping for 5 minutes.
> If a dead node happens at during step 2, and we notify the thread, it will 
> already be running so the notify will not do anything. It may be that some of 
> the containers from the node in question have been processed already, or they 
> may still to be processed - we don't really know. Perhaps this is OK, rather 
> than complicating the solution, as in general fixing decommission or 
> under-replication will take a long time.
> It is also possible that several nodes go dead in quick succession, or 
> several nodes go out of service quickly, resulting in several notify calls 
> occurring. We don't want to wake up the thread too frequently if this 
> happens, as it will result in a new replication queue getting created over 
> and over. Perhaps if the queue is not empty, then there is replication work 
> to do, and we should not run again.
> Finally, we might want to consider notifying on a node coming back into 
> service, as that could cause over-replication. However over-replication is 
> not as big of a problem as under-replication if it is not addressed quickly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to