[ 
https://issues.apache.org/jira/browse/FLINK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingsuo-zbz updated FLINK-39984:
--------------------------------
    Summary: Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM 
failover  (was: `requestThreadDump` blocks the JM/TM main thread and can cause 
heartbeat timeout / job failure)

> Thread dump RPC can cause heartbeat timeout and unnecessary JM/TM failover
> --------------------------------------------------------------------------
>
>                 Key: FLINK-39984
>                 URL: https://issues.apache.org/jira/browse/FLINK-39984
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / REST
>    Affects Versions: 1.17.2, 2.3.0, 1.20.5
>            Reporter:  xingsuo-zbz
>            Priority: Critical
>
>  
> Both `Dispatcher#requestThreadDump` (JobManager) and 
> `TaskExecutor#requestThreadDump` (TaskManager) currently execute 
> `ThreadDumpInfo.dumpAndCreate(...)` synchronously on the RPC actor main 
> thread:
> {{    }}
> {code:java}
> return 
> CompletableFuture.completedFuture(ThreadDumpInfo.dumpAndCreate(stacktraceMaxDepth));{code}
> {{ }}
> `dumpAndCreate` ends up calling `threadMxBean.dumpAllThreads(true, true)`, 
> which on a JVM with many threads (Netty, RocksDB, async I/O, user threads — 
> easily ~ 10k in production) can take several seconds to tens of seconds, 
> especially when collecting monitor and synchronizer info.
> While this call is in progress, the RPC actor cannot process any other 
> message, including:
>  * {{ heartbeat pings from the JobManager / ResourceManager,}}
>  * {{ task lifecycle messages,}}
>  * {{{} checkpoint trigger / confirm / abort messages.{}}}{{{{}}{}}}
>  
> If the dump takes longer than `heartbeat.timeout` (default 50s), the JM 
> declares the TM dead and triggers a failover, even though the TM itself is 
> fully functional — the heartbeat thread was simply queued behind the dump 
> request in the actor mailbox.
>  
> {{We have observed this in production:}}
> {{  }}
> {code:java}
> Heartbeat of TaskManager with id <tm-id> timed out. 
> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id ... 
> timed out.{code}
> {{ }}
> This is essentially a self-inflicted failure caused by a diagnostic tool — 
> clicking "Thread Dump" in the Web UI of a large-state job can kill the job.
> Other "potentially expensive" RPC handlers in `TaskExecutor` (e.g. 
> `requestLogList`,`requestFileUploadByFilePath`, `updatePartitions`) are 
> already dispatched onto `ioExecutor` / the scheduled executor. 
> `requestThreadDump` should follow the same pattern. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to