[ 
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingsuo-zbz updated FLINK-39984:
--------------------------------
    Component/s: Runtime / Web Frontend
    Description: 
  h2. Summary

  Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI can
  cause the targeted process to miss heartbeats and be killed as failed, taking
  down the running job. The diagnostic feature itself triggers the failure.

  Observed in production with errors such as:
  {quote}
  Heartbeat of TaskManager with id <tm-id> timed out.
  java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed out.
  {quote}

  h2. Root cause

  Two _independent_ issues compound:

  *1. The RPC handler runs synchronously on the main actor thread.*

  {{TaskExecutor#requestThreadDump}} 
(flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
  and {{Dispatcher#requestThreadDump}} 
(flink-runtime/.../dispatcher/Dispatcher.java:1858)
  both return:

  {code:java}
  return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
  {code}

  While the dump is being constructed, the actor mailbox does not advance, so
  heartbeat replies, task lifecycle messages, and checkpoint coordination
  messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
  (e.g. {{requestLogList}}, {{requestFileUploadByFilePath}}, 
{{updatePartitions}})
  are already offloaded to {{ioExecutor}} or the scheduled executor — this one
  was not.

  *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*

  {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) calls:

  {code:java}
  threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
lockedSynchronizers
  {code}

  Collecting locked monitors and AQS synchronizers requires walking every
  thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
  + async I/O + user threads — easily 10k+ threads in production), this can
  take many seconds to tens of seconds. During the safepoint, _every_ thread
  in the JVM is paused, including the heartbeat dispatcher itself.

  If the safepoint duration plus mailbox queueing exceeds {{heartbeat.timeout}}
  (default 50s), the JM marks the TM dead and triggers a failover — even
  though the TM is functional.

  Note: fixing only (1) helps short dumps but not long ones, because the
  safepoint pauses the heartbeat thread regardless of which executor the
  caller runs on. Fixing only (2) helps long dumps but still allows the
  mailbox to stall briefly. Both fixes are needed to fully address the issue.

  h2. Reproduction

  # Start a TaskManager with a job that creates many threads (e.g. a high-
    parallelism job with RocksDB state backend and async I/O operators).
  # In the Web UI, navigate to the TaskManager → Thread Dump tab.
  # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
    within ~50s; the job enters failover.

  h2. Proposed fix

  *Step 1 (this ticket): purely additive changes, no default-behavior change.*

  * Offload the dump computation off the RPC main thread, using
    {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
    single-flight (cache the in-flight future) so repeated UI clicks do
    not queue multiple dumps.
  * Introduce {{ThreadDumpMode \{FULL, SAFE\}}}:
  ** {{FULL}} — {{dumpAllThreads(true, true)}}, current behavior, retains
     locked-monitor / synchronizer info, useful for deadlock analysis.
  ** {{SAFE}} — {{dumpAllThreads(false, false)}}, skips monitor /
     synchronizer collection; safepoint is dramatically shorter on busy JVMs.
  * Surface the mode through:
  ** REST query parameter: {{GET 
/taskmanagers/\{id\}/thread-dump?mode=safe|full}}
     (and the analogous JM endpoint).
  ** Cluster config {{cluster.thread-dump.mode}} (default {{FULL}} — same as
     today; this ticket does not change observable defaults).
  ** Web UI: radio selector ({{Safe}} / {{Full}}) on the Thread Dump tab,
     with a popconfirm on {{Full}}.

  *Step 2 (separate \[DISCUSS\] on dev@): consider flipping the default to 
{{SAFE}}.*

  Splitting the default change out keeps Step 1 strictly additive and easy
  to review/merge; the default flip can be argued with production data on
  its own merits.


  was:
 
Both `Dispatcher#requestThreadDump` (JobManager) and 
`TaskExecutor#requestThreadDump` (TaskManager) currently execute 
`ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC actor main thread:

{{    }}
{code:java}
return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}

`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
which on a JVM with many threads (Netty, RocksDB, async I/O, user threads — 
easily ~ 10k in production) can take several seconds to tens of seconds, 
especially when collecting monitor and synchronizer info.

While this call is in progress, the RPC actor cannot process any other message, 
including:
 * {{ heartbeat pings from the JobManager / ResourceManager,}}
 * {{ task lifecycle messages,}}
 * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}

 

If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
declares the TM dead and triggers a failover, even though the TM itself is 
fully functional — the heartbeat thread was simply queued behind the dump 
request in the actor mailbox.

 

{{We have observed this in production:}}

{{  }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out. 
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
timed out.{code}
{{ }}

This is essentially a self-inflicted failure caused by a diagnostic tool — 
clicking "Thread Dump" in the Web UI of a large-state job can kill the job.

Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
`requestLogList`,`requestFileUploadByFilePath`, `updatePartitions`) are already 
dispatched onto `ioExecutor` / the scheduled executor. `requestThreadDump` 
should follow the same pattern. 


> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
>                 Key: FLINK-39984
>                 URL: https://issues.apache.org/jira/browse/FLINK-39984
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST, Runtime / Web Frontend
>    Affects Versions: 1.17.2, 2.3.0, 1.20.5
>            Reporter:  xingsuo-zbz
>            Priority: Critical
>
>   h2. Summary
>   Clicking "Thread Dump" on a JobManager or TaskManager in the Flink Web UI 
> can
>   cause the targeted process to miss heartbeats and be killed as failed, 
> taking
>   down the running job. The diagnostic feature itself triggers the failure.
>   Observed in production with errors such as:
>   {quote}
>   Heartbeat of TaskManager with id <tm-id> timed out.
>   java.util.concurrent.TimeoutException: Heartbeat of TaskManager ... timed 
> out.
>   {quote}
>   h2. Root cause
>   Two _independent_ issues compound:
>   *1. The RPC handler runs synchronously on the main actor thread.*
>   {{TaskExecutor#requestThreadDump}} 
> (flink-runtime/.../taskexecutor/TaskExecutor.java:1463)
>   and {{Dispatcher#requestThreadDump}} 
> (flink-runtime/.../dispatcher/Dispatcher.java:1858)
>   both return:
>   {code:java}
>   return 
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));
>   {code}
>   While the dump is being constructed, the actor mailbox does not advance, so
>   heartbeat replies, task lifecycle messages, and checkpoint coordination
>   messages all queue up behind it. Other heavy handlers in {{TaskExecutor}}
>   (e.g. {{requestLogList}}, {{requestFileUploadByFilePath}}, 
> {{updatePartitions}})
>   are already offloaded to {{ioExecutor}} or the scheduled executor — this one
>   was not.
>   *2. {{dumpAllThreads(true, true)}} triggers a long JVM-wide safepoint.*
>   {{JvmUtils#createThreadDump}} (flink-runtime/.../util/JvmUtils.java:50) 
> calls:
>   {code:java}
>   threadMxBean.dumpAllThreads(true, true);  // lockedMonitors + 
> lockedSynchronizers
>   {code}
>   Collecting locked monitors and AQS synchronizers requires walking every
>   thread's lock state inside a single safepoint. On busy JVMs (Netty + RocksDB
>   + async I/O + user threads — easily 10k+ threads in production), this can
>   take many seconds to tens of seconds. During the safepoint, _every_ thread
>   in the JVM is paused, including the heartbeat dispatcher itself.
>   If the safepoint duration plus mailbox queueing exceeds 
> {{heartbeat.timeout}}
>   (default 50s), the JM marks the TM dead and triggers a failover — even
>   though the TM is functional.
>   Note: fixing only (1) helps short dumps but not long ones, because the
>   safepoint pauses the heartbeat thread regardless of which executor the
>   caller runs on. Fixing only (2) helps long dumps but still allows the
>   mailbox to stall briefly. Both fixes are needed to fully address the issue.
>   h2. Reproduction
>   # Start a TaskManager with a job that creates many threads (e.g. a high-
>     parallelism job with RocksDB state backend and async I/O operators).
>   # In the Web UI, navigate to the TaskManager → Thread Dump tab.
>   # Observe in the JM log: {{Heartbeat of TaskManager with id ... timed out}}
>     within ~50s; the job enters failover.
>   h2. Proposed fix
>   *Step 1 (this ticket): purely additive changes, no default-behavior change.*
>   * Offload the dump computation off the RPC main thread, using
>     {{ioExecutor}} (consistent with {{requestLogList}} et al.). Apply
>     single-flight (cache the in-flight future) so repeated UI clicks do
>     not queue multiple dumps.
>   * Introduce {{ThreadDumpMode \{FULL, SAFE\}}}:
>   ** {{FULL}} — {{dumpAllThreads(true, true)}}, current behavior, retains
>      locked-monitor / synchronizer info, useful for deadlock analysis.
>   ** {{SAFE}} — {{dumpAllThreads(false, false)}}, skips monitor /
>      synchronizer collection; safepoint is dramatically shorter on busy JVMs.
>   * Surface the mode through:
>   ** REST query parameter: {{GET 
> /taskmanagers/\{id\}/thread-dump?mode=safe|full}}
>      (and the analogous JM endpoint).
>   ** Cluster config {{cluster.thread-dump.mode}} (default {{FULL}} — same as
>      today; this ticket does not change observable defaults).
>   ** Web UI: radio selector ({{Safe}} / {{Full}}) on the Thread Dump tab,
>      with a popconfirm on {{Full}}.
>   *Step 2 (separate \[DISCUSS\] on dev@): consider flipping the default to 
> {{SAFE}}.*
>   Splitting the default change out keeps Step 1 strictly additive and easy
>   to review/merge; the default flip can be argued with production data on
>   its own merits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to