slfan1989 commented on PR #7542: URL: https://github.com/apache/ozone/pull/7542#issuecomment-2533459041
> Is there any plan to build on these changes to add features? In general, I don't like refactoring things "just to make them nicer" unless things are very bad already. It creates churn on the code, risks introducing bugs and makes backports more difficult if any change is needed on this code due to a bug, and then needs to be backported to an earlier version. > > Note I haven't looked at the changes in detail. I only quickly looked to see what the scope of the change was. @sodonnel Thank you very much for your response! I do agree with your viewpoint to some extent. We indeed want to contribute a new feature to the community, and I have named this feature "Fast Decommission / Maintenance." However, before proceeding, we do need to carry out some preparatory work, and I feel that this PR is one of those necessary steps. > Background Currently, whether it's decommissioning or maintenance, we rely on UnderReplication. Replication in our system is very slow. As shown in the monitoring screenshot below, after I took one DataNode decommission, there were 38,000 Containers to be replicated. After 18 hours, only 5,768 Containers were replicated (from 38,228 to 32,460), and this process was accompanied by a significant amount of I/O.  > New Solution We have developed the V1 version of the Fast Decommission feature. After testing, we have confirmed that we can decommission at least 10 machines simultaneously, with each machine containing 140TB of data, and the process takes about 2 days. The general approach of this solution is as follows: 1. The SCM creates a decommissioning execution plan based on the decommission DataNode and the global data storage status (which Containers should be transferred to which Target DataNode). However, SCM is not responsible for executing this plan. 2. The SCM sends the execution plan generated in step 1 to the DataNodes that need to be decommissioned. The decommissioning DataNodes are then responsible for transferring the Containers to the corresponding Target DataNodes. - Monitoring Screenshot   This feature allows us to make better use of the system bandwidth and reduce system I/O wait. - DataNode Transfer Screenshot  The current time tracking metrics may have some issues, and I will fix this problem in the V2 version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
