[
https://issues.apache.org/jira/browse/FLINK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678650#comment-17678650
]
Matt Wang commented on FLINK-30680:
-----------------------------------
In our production environment, during the peak business period, some flink
streaming jobs maybe cause processing delays due to the slow task-manager: # In
some cases, we found that processing slow tasks(busytime 100% or tasks in pool
usage 100%) were clustered on the same nodes;
# These nodes usually have some problems, like: the CPU usage is too high,
some nodes are older and less powerful, the network delay of the node is
relatively high, and so on.
These cases appear more frequently. And k8s can solve some problems, but cannot
cover all cases. There should also be a speculative execution mechanism like
the batch job for streaming job to process this case automatically.
We launched the *Slow-TaskManager Dection mechanism* within the company to
automatically process this case. If there is a lag record in the job, the
mechanism will determine whether it is caused by the slow-Taskmanager.It will
require the job not to have data skew, and the slow-processing tasks are
clustered on the same taskmanager or node. If not clustered on the same
taskmanager/node, it is not judged as the slow-taskmanager.
After the slow taskmanager is found, Flink will release the taskmanager, and it
will be added to the blacklist (internal implementation, the machines that join
the blacklist will not schedule new pod), then the job will do failover
recovery. This mechanism works very effectively in our internal environment.
We are very happy to address this jira and contribute our internal mechanism to
the community, and discuss it together to improve the automatic operation and
maintenance capability of Flink.
> Consider using the autoscaler to detect slow taskmanagers
> ---------------------------------------------------------
>
> Key: FLINK-30680
> URL: https://issues.apache.org/jira/browse/FLINK-30680
> Project: Flink
> Issue Type: New Feature
> Components: Autoscaler, Kubernetes Operator
> Reporter: Gyula Fora
> Priority: Major
>
> We could leverage logic in the autoscaler to detect slow taskmanagers by
> comparing the per-record processing times between them.
> If we notice that all subtasks on a single TM are considerably slower than
> the rest (at similar input rates) we should try simply restarting the job
> instead of scaling it up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)