sodonnel commented on code in PR #3963:
URL: https://github.com/apache/ozone/pull/3963#discussion_r1029478291
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -325,10 +329,22 @@ public synchronized void processAll() {
containerManager.getContainers();
ReplicationManagerReport report = new ReplicationManagerReport();
ReplicationQueue newRepQueue = new ReplicationQueue();
+ Map<ContainerID, MoveDataNodePair> pendingMoves =
+ moveManaer.getPendingMove();
Review Comment:
A fundamental design problem with the old RM was it placing many hours or
even days worth or commands on the DNs without anyway to see what progress they
are making.
It seems the balancer has make this same mistake, in that it does not limit
what is queued on a DN. Ideally, the balancer should schedule no more work than
can be done in a heartbeat or two. If we have under utilized nodes, they are
going to be the target for replicate commands, so the limit for the number
queued should be quite small per heartbeat. The delete commands might be more
scattered, but we should even limit them, but to a higher number.
If we avoid deep queues on the DNs, then we also avoid the problem with a
leader switch on SCM.
At the moment ReplicationManager does not replicate its pending ops or
commands to each SCM. The goal is to make the queue on the DN small, so that
even if a leader switch did happen, there is not much more work to do,
otherwise RM can fall into this same problem or scheduling multiple deletes.
Better than that, we could also make the DN reject certain commands that
don't have the current SCM Leader term. For example, any replicate, delete
container, reconstruct container can be discarded if the SCM term has increased
and the replication manager will just re-create commands on its next pass.
There isn't really any reason that the Balancer could not do the same. Ie,
schedule work only on DNs with the capacity to process it, and then drop all
commands on a leader switch. It would make the balancer a lot more simple
overall.
Then, we could have the over-replication handling try to take space into
account and remove the replicas from the most over-used nodes, making it do the
correct thing more of the time. In the event of a failover, some
over-replicated containers caused by the balancer might have the wrong replica
removed, but if the free space is so finely balanced between the nodes that it
removes the wrong one, it doesn't really matter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]