[
https://issues.apache.org/jira/browse/FLINK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678683#comment-17678683
]
Gyula Fora commented on FLINK-30680:
------------------------------------
Thanks for the input [~wangm92] . I think this would be a good improvement in
the operator itself. We could combine this with the existing health check
mechanisms that detect jobs stuck in a failure loop and simply restart them in
those case.
Alternatively we could put this into the new autoscaler module which already
collects metrics from the flink jobs for scaling purposes.
> Consider using the autoscaler to detect slow taskmanagers
> ---------------------------------------------------------
>
> Key: FLINK-30680
> URL: https://issues.apache.org/jira/browse/FLINK-30680
> Project: Flink
> Issue Type: New Feature
> Components: Autoscaler, Kubernetes Operator
> Reporter: Gyula Fora
> Priority: Major
>
> We could leverage logic in the autoscaler to detect slow taskmanagers by
> comparing the per-record processing times between them.
> If we notice that all subtasks on a single TM are considerably slower than
> the rest (at similar input rates) we should try simply restarting the job
> instead of scaling it up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)