Stephan Ewen commented on FLINK-4715:

I think we should do the following:

Split the cancellation up into two threads:
  # The first thread calls {{cancel()}} on the task and {{interrupt()}} on the 
main thread. It then exits.
  # The second thread is a watchdog that kicks in after {{n}} seconds (default 
is 10, I think) and periodically calls {{interrupt()}} every {{n}} seconds. 
After a maximum duration (lets say 1 minute) it notifies the {{TaskManager}} of 
a fatal error. In most setups, this leads to a process kill.

The reason to separate this into two threads is that we have seen cases where 
{{cancel()}} blocks waiting on a lock held by the main thread. In that case, 
neither an {{interrupt()}} call would come, nor would the "task manager exit" 
safety net ever kick in.

> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>                 Key: FLINK-4715
>                 URL: https://issues.apache.org/jira/browse/FLINK-4715
>             Project: Flink
>          Issue Type: Improvement
>          Components: TaskManager
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>             Fix For: 1.2.0
> In case of a failed cancellation, e.g. the task cannot be cancelled after a 
> given time, the {{TaskManager}} should kill itself. That way we guarantee 
> that there is no resource leak. 
> This behaviour acts as a safety-net against faulty user code.

This message was sent by Atlassian JIRA

Reply via email to