vastian180 opened a new pull request, #3069:
URL: https://github.com/apache/celeborn/pull/3069

   ### What changes were proposed in this pull request?
   As title, improve 
PausePushDataTime、PausePushDataAndReplicateTime、pausePushDataCounter metric 
calculation logic.
   
   ### Why are the changes needed?
   During a stress test, it was found that the `pausePushDataAndReplicateTime` 
metric value was 55.1 years, which is obviously abnormal. As shown in the 
figure below.
   
![image](https://github.com/user-attachments/assets/e20f41d2-888d-4250-8fb6-c14e26efcf40)
 
   
   The reason is as follows:
   In the process of `ServingState` transition: `NONE PAUSED` -> `PAUSE PUSH` 
-> `PAUSE PUSH AND REPLICATE`
   The `pausePushDataAndReplicateStartTime` was not correctly assigned.
   When `trimCounter >= forceAppendPauseSpentTimeThreshold` or `ServingState` 
changes from `PAUSE PUSH AND REPLICATE` -> `NONE PAUSED` , the 
`appendPauseSpentTime` method is executed to update 
`pausePushDataAndReplicateTime`.
   The execution logic is `pausePushDataAndReplicateTime += 
System.currentTimeMillis() - -1L`, which will be displayed as 55.1 years. 
System.currentTimeMillis()/1000/3600/24/365.
   
   Similarly, in the process of `ServingState` transition: `NONE PAUSED` -> 
`PAUSE PUSH AND REPLICATE` -> `PAUSE PUSH` , the `pausePushDataStartTime` was 
not correctly assigned. 
   When `trimCounter >= forceAppendPauseSpentTimeThreshold` or `ServingState` 
changes from `PAUSE PUSH` -> `NONE PAUSED`, the `appendPauseSpentTime` method 
is executed to update `pausePushDataTime`, which will be displayed as 55.1 
years.
   
   Modify the logic of `pausePushDataCounter`:
   The `PAUSE PUSH AND REPLICATE` state includes the worker stopping receiving 
pushData. 
   Therefore: 
   When `NONE PAUSED` -> `PAUSE PUSH AND REPLICATE`: `pausePushDataCounter` 
needs to be increased.
   When `PAUSE PUSH AND REPLICATE` -> `PAUSE PUSH`: `pausePushDataCounter` does 
not need to be increased.
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   ### How was this patch tested?
   Celeborn Dashboard
   
![image](https://github.com/user-attachments/assets/0c82ef26-b66b-420b-9c33-e4dd19d5f396)
 
   MemoryManagerSuite#[CELEBORN-882] Test MemoryManager check memory thread 
logic


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to