[ 
https://issues.apache.org/jira/browse/FLINK-20886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-20886:
--------------------------------
    Description: 
For debugging checkpoint timeouts, I was thinking about the following addition 
to Flink:

When a checkpoint times out and the async thread is still running, create a 
thread dump [1] and either add this to the checkpoint stats, log it, or write 
it out.

This may help identifying where the checkpoint is stuck (maybe a lock, could 
also be in a third party lib like the FS connectors,...). It would give us some 
insights into what the thread is currently doing.

Limiting the scope of the threads would be nice but may not be possible in the 
general case since additional threads (spawned by the FS connector lib, or 
otherwise connected) may interact with the async thread(s) by e.g. going 
through the same locks. Maybe we can reduce the thread dumps to all async 
threads of the failed checkpoint + all thready that interact with it, e.g. via 
locks?

I'm also not sure whether the ability to have thread dumps or not should be 
user-configurable (Could it contain sensitive information from other jobs if 
you run a session cluster? Is that even relevant since we don't give isolation 
guarantees anyway?). If it is configurable, it should be on by default.


[1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/

  was:
For debugging checkpoint timeouts, I was thinking about the following addition 
to Flink:

When a checkpoint times out and the async thread is still running, create a 
threaddump [1] and either add this to the checkpoint stats, log it, or write it 
out.

This may help identifying where the checkpoint is stuck (maybe a lock, could 
also be in a third party lib like the FS connectors,...). It would give us some 
insights into what the thread is currently doing.

Limiting the scope of the threads would be nice but may not be possible in the 
general case since additional threads (spawned by the FS connector lib, or 
otherwise connected) may interact with the async thread(s) by e.g. going 
through the same locks.


[1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/


> Add the option to get a threaddump on checkpoint timeouts
> ---------------------------------------------------------
>
>                 Key: FLINK-20886
>                 URL: https://issues.apache.org/jira/browse/FLINK-20886
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Nico Kruber
>            Priority: Major
>
> For debugging checkpoint timeouts, I was thinking about the following 
> addition to Flink:
> When a checkpoint times out and the async thread is still running, create a 
> thread dump [1] and either add this to the checkpoint stats, log it, or write 
> it out.
> This may help identifying where the checkpoint is stuck (maybe a lock, could 
> also be in a third party lib like the FS connectors,...). It would give us 
> some insights into what the thread is currently doing.
> Limiting the scope of the threads would be nice but may not be possible in 
> the general case since additional threads (spawned by the FS connector lib, 
> or otherwise connected) may interact with the async thread(s) by e.g. going 
> through the same locks. Maybe we can reduce the thread dumps to all async 
> threads of the failed checkpoint + all thready that interact with it, e.g. 
> via locks?
> I'm also not sure whether the ability to have thread dumps or not should be 
> user-configurable (Could it contain sensitive information from other jobs if 
> you run a session cluster? Is that even relevant since we don't give 
> isolation guarantees anyway?). If it is configurable, it should be on by 
> default.
> [1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to