siddhantsangwan commented on PR #3963:
URL: https://github.com/apache/ozone/pull/3963#issuecomment-1337180696
> @siddhantsangwan can you explain why balancer need to replicate its
command in more details. i think you are familiar with this part.
Balancer's configurations are replicated so that they're available in the
new SCM on failover. As for moves being replicated, I think the original intent
was to provide consistency in scenarios such as -
Old leader scheduled a move from DN1 to DN2. Replication succeeded, and then
there's a failover. How can we ensure that the new leader deletes replica from
DN1 and not some other DN?
Back then, the Ratis approach was selected to ensure consistency. But some
other approaches were also considered. From the move design doc:
> Approach 1.2
Maintain a list of moves i.e. {cid, src_dn, target_dn}. Once the replica set
for container cid has both src_dn and target_dn and the container is
over-replicated, deletion can be scheduled for replica in src_dn. It would be
better for RM to have a map called inflightMoves which tracks such move
requests.
In the worst case container can become under-replicated i.e. container could
end up with 2 replicas instead of 3. Let’s say container c1 is being moved from
d2 to d4. RM1 sees replicas
{d1, d2, d3, d4} and schedules deletion for d2. RM2(new leader) also sees
replicas {d1, d2, d3, d4} (deletion not yet completed) and schedules deletion
for d1 based on a deterministic algo to handle over-replicated containers. In
this case remaining replicas would be {d3, d4}.
In case of no failovers, the move should be successful. On a failover move
map must be cleared.
I think @sodonnel is suggesting something similar here. We can avoid the
worst case scenario if balancer waits (like RM) after a failover and datanodes
drop commands from the old leader.
> Additionally, if we changed the over replication logic, so that it prefers
to delete a replica from a node with less free space than the others, then
replication becomes somewhat self balancing, and perhaps the delete part of the
balancer isn't need. As it stands we cannot make that over-replication
optimization in SCM, as the delete order is deterministic.
@sodonnel from the move doc:
> There is a problem with deleting the replica from the most used datanode.
Let’s consider a scenario - container c1 and c2 with replicas in {d1, d2, d3}.
In terms of utilisation d1 > d2 > d3. Balancer schedules move c1 from d1 to d4
and c2 from d2 to d4 given there is a limit of 1 on maximum moves from a source
datanode. In the Replication manager when copy is complete c1 and c2 replicas
will be deleted from d1 only. Expected behavior is for c1 to be deleted from d1
and c2 to be deleted from d2.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]