sodonnel commented on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876600258


   > This is just a limit on the number of containers to be processed, note 
that ReplicationManager count each container as processed no mater it is under 
replicated or over replicated or good.
   
   I'm not sure how well this will work in practice. Lets say we have 1M 
containers and the default limit here is 10,000. The default RM sleep interval 
is 5 minutes. It will take 100 iterations to process all the containers, so 500 
minutes. If we lose one node on the cluster, then some percentage of these 
containers will be under-replicated. Lets say we have a 100 node cluster, so:
   
   1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   
   In theory we could process them all in 3 iterations, but this change will 
not do that.
   
   We also have to consider decommission. In the example I gave above, it might 
take upto 500 minutes for all the containers on the decommissioning node to get 
processed. Or worse, for maintenance mode, there may be only a handful of 
containers that need replicated for the node to goto maintenance, but it will 
potentially get delayed by 500mins due to the batch size.
   
   RM has problems for sure - I don't like the way it calculates all the 
replication work for all containers and then sends all the work down to the 
DNs. The DNs don't really give feedback, the attempts timeout on the RM (by 
default after 10 minutes) and it schedules more. I feel that we need some way 
to hold back the work in SCM and feed it to the DNs as they have capacity to 
accept it. That might mean the DNs feeding back via their heartbeat the pending 
replication tasks, and then RM not scheduling more commands until the DN has 
free capacity.
   
   Some DNs on the cluster may work faster than others for some reason (newer 
hardware, under less load somehow etc). Ideally they would clear their 
replication work faster than others and hence be able to receive more, but as 
things stand the work is all passed out randomly, so we cannot take advantage 
of that.
   
   Another thing we might want to consider, is to trigger RM based on a dead 
node or decommission / maintenance event, rather than waiting for the thread 
sleep interval.
   
   I believe we need to have a good think about how RM works, and what we could 
do to improve it generally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to