[
https://issues.apache.org/jira/browse/FLINK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17639775#comment-17639775
]
Xintong Song commented on FLINK-30184:
--------------------------------------
[~fanrui], sorry for the late response.
I agree with [~wangyang0918] that this is probably more suitable for an
external service that manages / monitors Flink.
Thread dumps are for debugging and should not be activated constantly given the
performance impact. Flink already offers rest api for capturing thread stacks
of
[jobmanager|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobmanager-thread-dump]
and
[taskmanager|https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#taskmanagers-taskmanagerid-thread-dump].
It should be easy for an external monitoring system to capture the dumps when
the job is detected to be slow.
> Save TM/JM thread stack periodically
> ------------------------------------
>
> Key: FLINK-30184
> URL: https://issues.apache.org/jira/browse/FLINK-30184
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Web Frontend
> Reporter: Rui Fan
> Priority: Major
> Fix For: 1.17.0
>
>
> After FLINK-14816 FLINK-25398 and FLINK-25372 , flink user can view the
> thread stack of TM/JM in Flink WebUI.
> It can help flink users to find out why the Flink job is stuck, or why the
> processing is slow. It is very useful for trouble shooting.
> However, sometimes Flink tasks get stuck or process slowly, but when the user
> troubleshoots the problem, the job has resumed. It is difficult to find out
> what happened to the Flink job at the time and why is it slow?
>
> So, could we periodically save the thread stack of TM or JM in the TM log
> directory?
> Define some configurations:
> cluster.thread-dump.interval=1min
> cluster.thread-dump.cleanup-time=48 hours
--
This message was sent by Atlassian Jira
(v8.20.10#820010)