sodonnel commented on PR #3963: URL: https://github.com/apache/ozone/pull/3963#issuecomment-1326694265
The more I think about this, I believe that the Balancer / Move Manager and the Replication Manager need to be aligned in how they work. Replication Manager does not replicate its pendingOps across the standby SCM instances, so in the event of a failover, the "inflight" operations are not known to the new leader SCM. If RM is not replicating its pending moves, then why do we need the balancer to replicate its commands too? Failovers should be rare, and the worst case is that some balancing work does not complete as intended, but it can easily be rescheduled. When a SCM leader switches, the leader term gets updated and the datanodes receive that through their heartbeat after the next heartbeat to the new leader SCM. All commands scheduled on a datanode have the SCM leader term within them. So all we need to do, is tell the DN to drop any commands that are from an old SCM leader and we need to wait for a short period of time before the RM or Balancer runs after switching the SCM leader perhaps a few heartbeats, or until all DNs report an empty queue (this may be more tricky). Then the processes can startup and schedule new work. Additionally, if we changed the over replication logic, so that it prefers to delete a replica from a node with less free space than the others, then replication becomes somewhat self balancing, and perhaps the delete part of the balancer isn't need. As it stands we cannot make that over-replication optimization in SCM, as the delete order is deterministic. With EC, we could also make the pipeline policy "space aware" so it prefers to allocate new containers on nodes with more free space, also making the writes self balancing. Due to the long lived pipelines in Ratis, this is more difficult, but if we got to a place where pipelines are destroy after a few hours of uptime, then the Ratis pipeline policy could pick lesser used nodes too. In an ideal world, the system will self balance and the balancer would only be needed after adding new nodes to quickly move data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
