[ 
https://issues.apache.org/jira/browse/FLINK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678683#comment-17678683
 ] 

Gyula Fora commented on FLINK-30680:
------------------------------------

Thanks for the input [~wangm92] . I think this would be a good improvement in 
the operator itself.  We could combine this with the existing health check 
mechanisms that detect jobs stuck in a failure loop and simply restart them in 
those case.

Alternatively we could put this into the new autoscaler module which already 
collects metrics from the flink jobs for scaling purposes.

> Consider using the autoscaler to detect slow taskmanagers
> ---------------------------------------------------------
>
>                 Key: FLINK-30680
>                 URL: https://issues.apache.org/jira/browse/FLINK-30680
>             Project: Flink
>          Issue Type: New Feature
>          Components: Autoscaler, Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>
> We could leverage logic in the autoscaler to detect slow taskmanagers by 
> comparing the per-record processing times between them.
> If we notice that all subtasks on a single TM are considerably slower than 
> the rest (at similar input rates) we should try simply restarting the job 
> instead of scaling it up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to