sodonnel commented on PR #3963:
URL: https://github.com/apache/ozone/pull/3963#issuecomment-1326694265

   The more I think about this, I believe that the Balancer / Move Manager and 
the Replication Manager need to be aligned in how they work.
   
   Replication Manager does not replicate its pendingOps across the standby SCM 
instances, so in the event of a failover, the "inflight" operations are not 
known to the new leader SCM. If RM is not replicating its pending moves, then 
why do we need the balancer to replicate its commands too? Failovers should be 
rare, and the worst case is that some balancing work does not complete as 
intended, but it can easily be rescheduled.
   
   When a SCM leader switches, the leader term gets updated and the datanodes 
receive that through their heartbeat after the next heartbeat to the new leader 
SCM. All commands scheduled on a datanode have the SCM leader term within them.
   
   So all we need to do, is tell the DN to drop any commands that are from an 
old SCM leader and we need to wait for a short period of time before the RM or 
Balancer runs after switching the SCM leader perhaps a few heartbeats, or until 
all DNs report an empty queue (this may be more tricky). Then the processes can 
startup and schedule new work.
   
   Additionally, if we changed the over replication logic, so that it prefers 
to delete a replica from a node with less free space than the others, then 
replication becomes somewhat self balancing, and perhaps the delete part of the 
balancer isn't need. As it stands we cannot make that over-replication 
optimization in SCM, as the delete order is deterministic.
   
   With EC, we could also make the pipeline policy "space aware" so it prefers 
to allocate new containers on nodes with more free space, also making the 
writes self balancing. Due to the long lived pipelines in Ratis, this is more 
difficult, but if we got to a place where pipelines are destroy after a few 
hours of uptime, then the Ratis pipeline policy could pick lesser used nodes 
too. In an ideal world, the system will self balance and the balancer would 
only be needed after adding new nodes to quickly move data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to