[ 
https://issues.apache.org/jira/browse/FLINK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678650#comment-17678650
 ] 

Matt Wang commented on FLINK-30680:
-----------------------------------

In our production environment, during the peak business period, some flink 
streaming jobs maybe cause processing delays due to the slow task-manager: # In 
some cases, we found that processing slow tasks(busytime 100% or tasks in pool 
usage 100%) were clustered on the same nodes;
 # These nodes usually have some problems, like: the CPU usage is too high, 
some nodes are older and less powerful, the network delay of the node is 
relatively high, and so on.

These cases appear more frequently. And k8s can solve some problems, but cannot 
cover all cases. There should also be a speculative execution mechanism like 
the batch job for streaming job to process this case automatically.


We launched the *Slow-TaskManager Dection mechanism* within the company to 
automatically process this case. If there is a lag record in the job, the 
mechanism will determine whether it is caused by the slow-Taskmanager.It will 
require the job not to have data skew, and the slow-processing tasks are 
clustered on the same taskmanager or node. If not clustered on the same 
taskmanager/node, it is not judged as the slow-taskmanager.
 
After the slow taskmanager is found, Flink will release the taskmanager, and it 
will be added to the blacklist (internal implementation, the machines that join 
the blacklist will not schedule new pod), then the job will do failover 
recovery. This mechanism works very effectively in our internal environment.
 
We are very happy to address this jira and contribute our internal mechanism to 
the community, and discuss it together to improve the automatic operation and 
maintenance capability of Flink.

> Consider using the autoscaler to detect slow taskmanagers
> ---------------------------------------------------------
>
>                 Key: FLINK-30680
>                 URL: https://issues.apache.org/jira/browse/FLINK-30680
>             Project: Flink
>          Issue Type: New Feature
>          Components: Autoscaler, Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>
> We could leverage logic in the autoscaler to detect slow taskmanagers by 
> comparing the per-record processing times between them.
> If we notice that all subtasks on a single TM are considerably slower than 
> the rest (at similar input rates) we should try simply restarting the job 
> instead of scaling it up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to