[ 
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingsuo-zbz updated FLINK-39984:
--------------------------------
    Description: 
 
Both `Dispatcher#requestThreadDump` (JobManager) and 
`TaskExecutor#requestThreadDump` (TaskManager)
currently execute `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC 
actor main thread:

{{    }}
{code:java}
return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}

`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
which on a JVM with
many threads (Netty, RocksDB, async I/O, user threads — easily ~ 10k in 
production) can take
several seconds to tens of seconds, especially when collecting monitor and 
synchronizer info.

While this call is in progress, the RPC actor cannot process any other message, 
including:
 * {{ heartbeat pings from the JobManager / ResourceManager,}}
 * {{ task lifecycle messages,}}
 * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}

 

If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
declares the TM dead
and triggers a failover, even though the TM itself is fully functional — the 
heartbeat thread
was simply queued behind the dump request in the actor mailbox.

 

{{We have observed this in production:}}

{{  }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out. 
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
timed out.{code}
{{ }}

This is essentially a self-inflicted failure caused by a diagnostic tool — 
clicking
"Thread Dump" in the Web UI of a large-state job can kill the job.

Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
`requestLogList`,
`requestFileUploadByFilePath`, `updatePartitions`) are already dispatched onto 
`ioExecutor` /
the scheduled executor. `requestThreadDump` should follow the same pattern. 

  was:
 
{{Both `Dispatcher#requestThreadDump` (JobManager) and 
`TaskExecutor#requestThreadDump` (TaskManager)
currently execute `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC 
actor main thread:}}

{{    }}
{code:java}
return 
CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
{{ }}

{{`dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
which on a JVM with
many threads (Netty, RocksDB, async I/O, user threads — easily ~ 10k in 
production) can take
several seconds to tens of seconds, especially when collecting monitor and 
synchronizer info.

While this call is in progress, the RPC actor cannot process any other message, 
including:}}
 * {{ heartbeat pings from the JobManager / ResourceManager,}}
 * {{ task lifecycle messages,}}
 * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{}{}}}

{{}}

{{If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
declares the TM dead
and triggers a failover, even though the TM itself is fully functional — the 
heartbeat thread
was simply queued behind the dump request in the actor mailbox.}}

{{}}

{{}}

{{We have observed this in production:}}

{{  }}
{code:java}
Heartbeat of TaskManager with id <tm-id> timed out. 
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
timed out.{code}
{{ }}

{{This is essentially a self-inflicted failure caused by a diagnostic tool — 
clicking
"Thread Dump" in the Web UI of a large-state job can kill the job.}}

{{}}

{{Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
`requestLogList`,
`requestFileUploadByFilePath`, `updatePartitions`) are already dispatched onto 
`ioExecutor` /
the scheduled executor. `requestThreadDump` should follow the same pattern. }}


> `requestThreadDump` blocks the JM/TM main thread and can cause heartbeat 
> timeout / job failure
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39984
>                 URL: https://issues.apache.org/jira/browse/FLINK-39984
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / REST
>    Affects Versions: 1.17.2, 2.3.0, 1.20.5
>            Reporter:  xingsuo-zbz
>            Priority: Critical
>
>  
> Both `Dispatcher#requestThreadDump` (JobManager) and 
> `TaskExecutor#requestThreadDump` (TaskManager)
> currently execute `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the 
> RPC actor main thread:
> {{    }}
> {code:java}
> return 
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
> {{ }}
> `dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
> which on a JVM with
> many threads (Netty, RocksDB, async I/O, user threads — easily ~ 10k in 
> production) can take
> several seconds to tens of seconds, especially when collecting monitor and 
> synchronizer info.
> While this call is in progress, the RPC actor cannot process any other 
> message, including:
>  * {{ heartbeat pings from the JobManager / ResourceManager,}}
>  * {{ task lifecycle messages,}}
>  * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}
>  
> If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
> declares the TM dead
> and triggers a failover, even though the TM itself is fully functional — the 
> heartbeat thread
> was simply queued behind the dump request in the actor mailbox.
>  
> {{We have observed this in production:}}
> {{  }}
> {code:java}
> Heartbeat of TaskManager with id <tm-id> timed out. 
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
> timed out.{code}
> {{ }}
> This is essentially a self-inflicted failure caused by a diagnostic tool — 
> clicking
> "Thread Dump" in the Web UI of a large-state job can kill the job.
> Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
> `requestLogList`,
> `requestFileUploadByFilePath`, `updatePartitions`) are already dispatched 
> onto `ioExecutor` /
> the scheduled executor. `requestThreadDump` should follow the same pattern. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to