[
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yun Tang reassigned FLINK-39984:
--------------------------------
Assignee: xingsuo-zbz
> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
> Key: FLINK-39984
> URL: https://issues.apache.org/jira/browse/FLINK-39984
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST, Runtime / Web Frontend
> Affects Versions: 1.17.2, 2.3.0, 1.20.5
> Reporter: xingsuo-zbz
> Assignee: xingsuo-zbz
> Priority: Critical
>
> h2. Summary
> Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
> cause the targeted process to miss heartbeats and be killed as failed, taking
> down the running job. The diagnostic feature itself triggers the failure.
> Observed in production with errors such as:
> {quote}Heartbeat of TaskManager with id <tm-id> timed out.
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
> {quote}
> h2. Root cause
> Two _independent_ issues compound:
> *1. The RPC handler runs synchronously on the main actor thread.*
> {{TaskExecutor#requestThreadDump}}
> (flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
> and {{Dispatcher#requestThreadDump}}
> (flink-runtime/.../dispatcher/Dispatcher.java:1858)
> both return:
> {code:java}
> return
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
> {code}
> While the dump is being constructed, the actor mailbox does not advance, so
> heartbeat replies, task lifecycle messages, and checkpoint coordination
> messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
> (e.g. {{{}requestLogList{}}}, {{{}requestFileUploadByFilePath{}}},
> {{{}updatePartitions{}}})
> are already offloaded to {{ioExecutor}} or the scheduled executor — this one
> was not.
> *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
> {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:
> {code:java}
> threadMxBean.dumpAllThreads(true, true); // lockedMonitors +
> lockedSynchronizers
> {code}
> Collecting locked monitors and AQS synchronizers requires walking every
> thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
> + async I/O + user threads — easily 10k+ threads in production), this can
> take many seconds to tens of seconds. During the safepoint, _every_ thread
> in the JVM is paused, including the heartbeat dispatcher itself.
> If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
> (default 50s), the JM marks the TM dead and triggers a failover — even
> though the TM is functional.
> Note: fixing only (1) helps short dumps but not long ones, because the
> safepoint pauses the heartbeat thread regardless of which executor the
> caller runs on. Fixing only (2) helps long dumps but still allows the
> mailbox to stall briefly. Both fixes are needed to fully address the issue.
> h2. Reproduction
> # Start a TaskManager with a job that creates many threads (e.g. a high-
> parallelism job with RocksDB state backend and async I/O operators).
> # In the Web UI, navigate to the TaskManager → Thread Dump tab.
> # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
> within ~50s; the job enters failover.
> h2. Proposed fix
> *Step 1 (this ticket): purely additive changes, no default-behavior change.*
> * Offload the dump computation off the RPC main thread, using
> {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
> single-flight (cache the in-flight future) so repeated UI clicks do
> not queue multiple dumps.
> * Introduce {{{}ThreadDumpMode {FULL, SAFE{}}}}:
> ** {{FULL}} — {{{}dumpAllThreads(true, true){}}}, current behavior, retains
> locked-monitor / synchronizer info, useful for deadlock analysis.
> ** {{SAFE}} — {{{}dumpAllThreads(false, false){}}}, skips monitor /
> synchronizer collection; safepoint is dramatically shorter on busy JVMs.
> * Surface the mode through:
> ** REST query parameter: {{GET
> /taskmanagers/\{id}/thread-dump?mode=safe|full}}
> (and the analogous JM endpoint).
> ** Cluster config {{cluster.thread-dump.default.mode}} (default {{FULL}} —
> same as
> today; this ticket does not change observable defaults).
> ** Web UI: radio selector ({{{}Safe{}}} / {{{}Full{}}}) on the Thread Dump
> tab,
> with a popconfirm on {{{}Full{}}}.
> *Step 2 (separate [DISCUSS] on dev@): consider flipping the default to
> {{{}SAFE{}}}.*
> Splitting the default change out keeps Step 1 strictly additive and easy
> to review/merge; the default flip can be argued with production data on
> its own merits.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)