sodonnel commented on PR #3384: URL: https://github.com/apache/ozone/pull/3384#issuecomment-1125885960
For now, I would like to leave the current LegacyReplicationManager class as it is and not focus on Balancer for EC. I feel that a lot of the logic for pending replications is overly complex, and if we just lift it out of LegacyReplicationManager and into a new class, it does not improve things - we have really just renamed the code. The legacy replication manager internally keeps a list of all pending replications and deletes. Each time a container is checked, it check this list and removes any replications that have been completed or expired. Then it gets the list of remaining pending operations to help decide if container is healthy or not. Rather than the ReplicationManager removing the completed and expired replications, we could have a standalone PendingContainerOps monitor, that works as follows: 1. Replication Manager adds pending replications and deletes to it. 2. Replication Manager queries it for anything pending for the current container and gets a list of PendingActions back. 3. The PendingReplicationMonitor has its own internal thread that checks for expired replications and removes them. 4. Completed replications and deletes are removed in ComtainerManagerImpl, which has updateContainerReplica and removeContainerReplica triggered via the container reports (ICR and FCR) from the datanodes as they are replicated. This way, the ReplicationManager does not need to worry about expiring replications or removing completed entries. We also get the ability to have a more up-to-date view of the system, as the ICR / FCRs will keep the pending table up-to-date in real time, rather than having to wait for the container to be re-checked inside replication manager. We can have a fairly simple "ContainerReplicaPendingOps" class that is basically standalone and inject it into ReplicationManager and ContainerManagerImpl. This would allow for removing some complexity from RM and let the expiry and completion be tested in an isolated way. I generally agree with your suggestions on the classes / functions we need to have around ReplicationManager. I have already started working on the health check interface in [HDDS-6697](https://issues.apache.org/jira/browse/HDDS-6697), but I got sidetracked into the EcContainerReplicaCounts class, which I realised I needed before going much further. I will have a go at creating the outline of what I described above and see how it looks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
