Re: [PR] HDDS-11498. Improve SCM deletion efficiency. [ozone]

via GitHub Mon, 14 Oct 2024 18:31:37 -0700


slfan1989 commented on PR #7249:
URL: https://github.com/apache/ozone/pull/7249#issuecomment-2412636232


   > @slfan1989 Thanks for working over this. The case as getting fix refers 
case where dnList do not have the DNs. This is possible if DN is not healthy OR 
overloaded, and hence not considered. Earlier thought was If replica is there 
means DNs are healthy, but with overload condition, this might not be true.
   > 
   > Please check if this is same scenario where this is happening?
   > 
   > Overall changes looks good with few minor comments.
   
   @sumitagrawl @ashishkumar50 
   
   Thank you very much for your feedback and suggestions !
   
   The main issue we are facing is the uncertainty of the deletion rate: 
sometimes it is relatively fast, but most of the time it is very slow, 
particularly during certain periods.
   
   Regarding the "dnList do not have the DNs" situation, it may be due to some 
DNs being unhealthy or under heavy load. In distributed systems, it is common 
for some DNs to be in an unhealthy state. While the replicas in these unhealthy 
DNs cannot be deleted, this should not affect the deletion efficiency of data 
in other healthy DNs.
   
   Based on our monitoring metrics, here is the deletion monitoring data from 
October 4, 2024, 00:00 to October 7, 2024, 00:00:
   
   Overall monitoring shows that during the periods of October 4, 2024, from 
00:00 to 04:00 and from 04:30 to 08:30, the deletion operations were relatively 
fast. However, during other time periods, the deletion efficiency significantly 
decreased, making it nearly impossible to complete deletion tasks.
   
   
![image](https://github.com/user-attachments/assets/bccd0d4f-1bc2-4a5e-818e-8213f152308e)
   
   To better compare the deletion efficiency, I will provide two additional 
monitoring reports for the periods of October 4, 2024, from 04:30 to 08:30 and 
October 6, 2024, from 04:30 to 08:30. 
   
   We can observe that during the period of October 4, 2024, from 04:30 to 
08:30, the ratio of `NumBlockDeletionTransactionSuccess` to 
`NumBlockDeletionTransactionCompleted` is 3:1. In contrast, during the period 
of October 6, 2024, from 04:30 to 08:30, this ratio is significantly higher.
   
   > 2024-10-04 04:30:00 ~ 2024-10-04 08:30:00
   
   
![image](https://github.com/user-attachments/assets/ca873cef-c83f-465a-88c6-16da3aea34cb)
   
   > 2024-10-06 04:30:00 ~ 2024-10-06 08:30:00
   
   
![image](https://github.com/user-attachments/assets/df460a08-8896-4b88-8932-bf5bd456d634)
   
   From the monitoring data for October 6, 2024, from 04:30 to 08:30, we can 
infer that a significant number of container replicas within the SCM were 
partially deleted, preventing the SCM from completing the deletion confirmation.
   
   Our cluster consists of 1,500 machines, and the number of malfunctioning 
machines does not exceed 10. If more than 10 machines fail, we receive an alert 
and begin manual intervention for these DNs. In my personal view, the slow 
deletion issue arises from a combination of multiple factors, and it should not 
be attributed solely to unhealthy DNs or high load.
   
   The contributing factors I personally speculate are as follows:
   
   1. The OM continuously issues deletion requests for blocks to the SCM, which 
alters the order in which the SCM selects transaction iterators.
   2. The replicas of a certain container are only sent to the DNs included in 
the DN list.
   3. Due to some unhealthy DNs or high load on DNs, the DN list is constantly 
changing.
   
   Based on the current monitoring observations, once some DNs in the replicas 
of a certain container (A) do not receive the deletion command, even if those 
DNs later recover, they may struggle to quickly receive the deletion command 
for container (A) due to potential changes in the SCM's transaction iterator 
(as the OM continuously issues new deletion requests). As a result, these DNs 
may take a long time to receive the deletion command, leading to a slow 
deletion process.
   
   The actual purpose of this PR is to make deletions orderly. If a certain 
container does not meet the conditions for deletion, we will ignore it until it 
satisfies the deletion conditions (i.e., when all DNs are healthy).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-11498. Improve SCM deletion efficiency. [ozone]

Reply via email to