[jira] [Commented] (FLINK-20992) Checkpoint cleanup can kill JobMaster

Roman Khachatryan (Jira) Mon, 18 Jan 2021 03:40:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267188#comment-17267188
 ]


Roman Khachatryan commented on FLINK-20992:
-------------------------------------------

I've published a simple PR to address the issue directly.

 

However, this is not the first time we hit this RejectedExecutionException 
problem (e.g. FLINK-18290).

I think the reason is that the executors used by coordinator aren't aware of 
it's lifecycle.

So I propose to:
 # Create executors inside CheckpointCoordinator (both io & timer thread pools)
 # Check isShutdown() in their error handlers (if yes and it's 
RejectedExecutionException then just log; otherwise delegate to 
FatalExitExceptionHandler)
 # (this will allow to remove such RejectedExecutionException checks from 
coordinator code)

 

Additionally, I found that during the shutting down we don't wait for 
checkpoint cleanup to complete (or any other tasks submitted to executors):
{code:java}
checkpointCoordinatorTimer.shutdownNow() // in ExecutionGraph
scheduledExecutorService.shutdownNow(); // in JobManagerSharedServices
{code}
So only currently executing actions will complete, but not any queued.

I think we SHOULD complete cleanup on shutdown and propose the following:
 # Replace shutdownNow with shutdown to allow cleanup to finish
 # Add awaitTermination (with timeout)
 # At least log the result of shutdownNow (list of runnables)

 

 WDYT [~trohrmann]?

I'd create separate tickets for the latter two issues.

> Checkpoint cleanup can kill JobMaster
> -------------------------------------
>
>                 Key: FLINK-20992
>                 URL: https://issues.apache.org/jira/browse/FLINK-20992
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.13.0, 1.12.2
>
>
> A user reported that cancelling a job can lead to an uncaught exception which 
> kills the {{JobMaster}}. The problem seems to be that the 
> {{CheckpointsCleaner}} might trigger {{CheckpointCoordinator}} actions after 
> the job has reached a terminal state and, thus, is shut down. Apparently, we 
> do not properly manage the lifecycles of {{CheckpointCoordinator}} and 
> checkpoint post clean up actions.
> The uncaught exception is 
> {code}
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@41554407 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@5d0ec6f7[Terminated, pool 
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 25977] at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063
>  at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622
>  at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668
>  at 
> org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:62
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1152
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:58
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624
>  at java.lang.Thread.run(Thread.java:748 undefined)
> {code}
> cc [~roman_khachatryan].
> https://lists.apache.org/thread.html/r75901008d88163560aabb8ab6fc458cd9d4f19076e03ae85e00f9a3a%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20992) Checkpoint cleanup can kill JobMaster

Reply via email to