slfan1989 commented on PR #7249: URL: https://github.com/apache/ozone/pull/7249#issuecomment-2412636232
> @slfan1989 Thanks for working over this. The case as getting fix refers case where dnList do not have the DNs. This is possible if DN is not healthy OR overloaded, and hence not considered. Earlier thought was If replica is there means DNs are healthy, but with overload condition, this might not be true. > > Please check if this is same scenario where this is happening? > > Overall changes looks good with few minor comments. @sumitagrawl @ashishkumar50 Thank you very much for your feedback and suggestions ! The main issue we are facing is the uncertainty of the deletion rate: sometimes it is relatively fast, but most of the time it is very slow, particularly during certain periods. Regarding the "dnList do not have the DNs" situation, it may be due to some DNs being unhealthy or under heavy load. In distributed systems, it is common for some DNs to be in an unhealthy state. While the replicas in these unhealthy DNs cannot be deleted, this should not affect the deletion efficiency of data in other healthy DNs. Based on our monitoring metrics, here is the deletion monitoring data from October 4, 2024, 00:00 to October 7, 2024, 00:00: Overall monitoring shows that during the periods of October 4, 2024, from 00:00 to 04:00 and from 04:30 to 08:30, the deletion operations were relatively fast. However, during other time periods, the deletion efficiency significantly decreased, making it nearly impossible to complete deletion tasks.  To better compare the deletion efficiency, I will provide two additional monitoring reports for the periods of October 4, 2024, from 04:30 to 08:30 and October 6, 2024, from 04:30 to 08:30. We can observe that during the period of October 4, 2024, from 04:30 to 08:30, the ratio of `NumBlockDeletionTransactionSuccess` to `NumBlockDeletionTransactionCompleted` is 3:1. In contrast, during the period of October 6, 2024, from 04:30 to 08:30, this ratio is significantly higher. > 2024-10-04 04:30:00 ~ 2024-10-04 08:30:00  > 2024-10-06 04:30:00 ~ 2024-10-06 08:30:00  From the monitoring data for October 6, 2024, from 04:30 to 08:30, we can infer that a significant number of container replicas within the SCM were partially deleted, preventing the SCM from completing the deletion confirmation. Our cluster consists of 1,500 machines, and the number of malfunctioning machines does not exceed 10. If more than 10 machines fail, we receive an alert and begin manual intervention for these DNs. In my personal view, the slow deletion issue arises from a combination of multiple factors, and it should not be attributed solely to unhealthy DNs or high load. The contributing factors I personally speculate are as follows: 1. The OM continuously issues deletion requests for blocks to the SCM, which alters the order in which the SCM selects transaction iterators. 2. The replicas of a certain container are only sent to the DNs included in the DN list. 3. Due to some unhealthy DNs or high load on DNs, the DN list is constantly changing. Based on the current monitoring observations, once some DNs in the replicas of a certain container (A) do not receive the deletion command, even if those DNs later recover, they may struggle to quickly receive the deletion command for container (A) due to potential changes in the SCM's transaction iterator (as the OM continuously issues new deletion requests). As a result, these DNs may take a long time to receive the deletion command, leading to a slow deletion process. The actual purpose of this PR is to make deletions orderly. If a certain container does not meet the conditions for deletion, we will ignore it until it satisfies the deletion conditions (i.e., when all DNs are healthy). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
