[jira] [Comment Edited] (FLINK-36531) AutoScaler needs to consider the lag from last checkpoint

Sai Sharath Dandi (Jira) Mon, 18 Nov 2024 13:06:18 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899228#comment-17899228
 ]


Sai Sharath Dandi edited comment on FLINK-36531 at 11/18/24 7:35 PM:
---------------------------------------------------------------------

My understanding is that FLIP-461 accumulates the scaling actions from 
autoscaler on the scheduler side, so it will not use an outdated action as long 
as autoscaler is running continuously. I think the improvement to rescale early 
than delay the scaling till the next checkpoint(in case of in-frequent 
checkpoints) could be better handled on the scheduler side rather than 
autoscaler. Overall though, I feel there is a lack of synergy and clear 
separation of concerns across autoscaler and scheduler. Autoscaler isn't 
"aware" of what's going on, on the scheduler side, which causes the scaling 
actions to be delayed.


was (Author: JIRAUSER298466):
My understanding is that FLIP-461 accumulates the scaling actions from 
autoscaler on the scheduler side, so it will not use an outdated action as long 
as autoscaler is running continuously. I think the improvement to rescale early 
than delay the scaling till the next checkpoint(in case of in-frequent 
checkpoints) could be better handled on the scheduler side rather than 
autoscaler. Overall though, I feel there is a lack of clear separation of 
concerns across autoscaler and scheduler. Autoscaler isn't "aware" of what's 
going on, on the scheduler side, which causes the scaling actions to be delayed.

> AutoScaler needs to consider the lag from last checkpoint
> ---------------------------------------------------------
>
>                 Key: FLINK-36531
>                 URL: https://issues.apache.org/jira/browse/FLINK-36531
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Sai Sharath Dandi
>            Priority: Major
>
> Autoscaler computes the target processing capacity as 
> [below|https://sg.uberinternal.com/code.uber.internal/uber-code/[email protected]/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47]
> // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + 
> INPUT_RATE/TARGET_UTIL
>  
> During the scaling action, the autoscaler will restart the job from the last 
> successful checkpoint, we need to include the number of processed records 
> since last successful checkpoint as part of the lag as those records will be 
> replayed after scaling. This is particularly important for jobs with long 
> checkpoint intervals and large state as there could be a significant 
> difference between the realtime lag and the lag from the checkpoint



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-36531) AutoScaler needs to consider the lag from last checkpoint

Reply via email to