slfan1989 commented on PR #7542:
URL: https://github.com/apache/ozone/pull/7542#issuecomment-2533459041

   > Is there any plan to build on these changes to add features? In general, I 
don't like refactoring things "just to make them nicer" unless things are very 
bad already. It creates churn on the code, risks introducing bugs and makes 
backports more difficult if any change is needed on this code due to a bug, and 
then needs to be backported to an earlier version.
   > 
   > Note I haven't looked at the changes in detail. I only quickly looked to 
see what the scope of the change was.
   
   @sodonnel  Thank you very much for your response! I do agree with your 
viewpoint to some extent. We indeed want to contribute a new feature to the 
community, and I have named this feature "Fast Decommission / Maintenance." 
However, before proceeding, we do need to carry out some preparatory work, and 
I feel that this PR is one of those necessary steps.
   
   > Background
   
   Currently, whether it's decommissioning or maintenance, we rely on 
UnderReplication. Replication in our system is very slow. As shown in the 
monitoring screenshot below, after I took one DataNode decommission, there were 
38,000 Containers to be replicated. After 18 hours, only 5,768 Containers were 
replicated (from 38,228 to 32,460), and this process was accompanied by a 
significant amount of I/O.
   
   
![image](https://github.com/user-attachments/assets/c08a385e-bb6b-46b8-ad46-10c47274c820)
   
   > New Solution
   
   We have developed the V1 version of the Fast Decommission feature. After 
testing, we have confirmed that we can decommission at least 10 machines 
simultaneously, with each machine containing 140TB of data, and the process 
takes about 2 days. The general approach of this solution is as follows:
   
   1. The SCM creates a decommissioning execution plan based on the 
decommission DataNode and the global data storage status (which Containers 
should be transferred to which Target DataNode). However, SCM is not 
responsible for executing this plan.
   2. The SCM sends the execution plan generated in step 1 to the DataNodes 
that need to be decommissioned. The decommissioning DataNodes are then 
responsible for transferring the Containers to the corresponding Target 
DataNodes.
   
   - Monitoring Screenshot 
   
   
![image](https://github.com/user-attachments/assets/7daa640d-78cf-4702-ac38-5ac7b968b8d6)
   
![image](https://github.com/user-attachments/assets/4a4db66d-8808-4412-b911-4b1c97128d07)
   
   This feature allows us to make better use of the system bandwidth and reduce 
system I/O wait.
   
   - DataNode Transfer Screenshot
   
   
![image](https://github.com/user-attachments/assets/7a3dacfc-74fc-4dc8-b34a-2331431ce08d)
   
   The current time tracking metrics may have some issues, and I will fix this 
problem in the V2 version.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to