[jira] [Commented] (FLINK-30680) Consider using the autoscaler to detect slow taskmanagers

Matt Wang (Jira) Thu, 19 Jan 2023 19:42:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678973#comment-17678973
 ]


Matt Wang commented on FLINK-30680:
-----------------------------------

[~gyfora] When we designed it internally, we also considered that this 
mechanism will rely heavily on Metrics. It also needs to observe the time trend 
of metrics (to judge whether the latency of a job is on the rise or decline), 
so it is implemented in a control service rather than inside the flink engine. 
So I think it is a good choice to implement this in the operator, which can 
reuse the Metrics module of the autoscaler.

> Consider using the autoscaler to detect slow taskmanagers
> ---------------------------------------------------------
>
>                 Key: FLINK-30680
>                 URL: https://issues.apache.org/jira/browse/FLINK-30680
>             Project: Flink
>          Issue Type: New Feature
>          Components: Autoscaler, Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>
> We could leverage logic in the autoscaler to detect slow taskmanagers by 
> comparing the per-record processing times between them.
> If we notice that all subtasks on a single TM are considerably slower than 
> the rest (at similar input rates) we should try simply restarting the job 
> instead of scaling it up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30680) Consider using the autoscaler to detect slow taskmanagers

Reply via email to