peterxcli opened a new pull request, #7997:
URL: https://github.com/apache/ozone/pull/7997

   ## What changes were proposed in this pull request?
   
   If someone triggers decommission / maintenance, there is potentially a 5 
minute lag from the decommission process starting and RM noticing that 
containers need replication, due to RM running on a 5 minute interval. 
Similarly, if a node goes dead, it has already been gone for 10 minutes, and it 
will take up to another 5 minutes for RM to notice and process the containers.
   
   It would be good to notify the RM thread to wake it up when these events 
happen to reduce the time it takes to start to repair the problem.
   
   One thing that comes to mind about for any solution, is that RM operates by:
   
   1. Getting a list of all containers.
   2. Processing the list
   3. Sleeping for 5 minutes.
   
   If a dead node happens at during step 2, and we notify the thread, it will 
already be running so the notify will not do anything. It may be that some of 
the containers from the node in question have been processed already, or they 
may still to be processed - we don't really know. Perhaps this is OK, rather 
than complicating the solution, as in general fixing decommission or 
under-replication will take a long time.
   
   It is also possible that several nodes go dead in quick succession, or 
several nodes go out of service quickly, resulting in several notify calls 
occurring. We don't want to wake up the thread too frequently if this happens, 
as it will result in a new replication queue getting created over and over. 
Perhaps if the queue is not empty, then there is replication work to do, and we 
should not run again.
   
   Finally, we might want to consider notifying on a node coming back into 
service, as that could cause over-replication. However over-replication is not 
as big of a problem as under-replication if it is not addressed quickly.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-8660
   
   ## How was this patch tested?
   
   unit tests to test the replicationManager notify call
   
   CI:
   https://github.com/peterxcli/ozone/actions/runs/13617499351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to