[ https://issues.apache.org/jira/browse/FLINK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267188#comment-17267188 ]
Roman Khachatryan commented on FLINK-20992: ------------------------------------------- I've published a simple PR to address the issue directly. However, this is not the first time we hit this RejectedExecutionException problem (e.g. FLINK-18290). I think the reason is that the executors used by coordinator aren't aware of it's lifecycle. So I propose to: # Create executors inside CheckpointCoordinator (both io & timer thread pools) # Check isShutdown() in their error handlers (if yes and it's RejectedExecutionException then just log; otherwise delegate to FatalExitExceptionHandler) # (this will allow to remove such RejectedExecutionException checks from coordinator code) Additionally, I found that during the shutting down we don't wait for checkpoint cleanup to complete (or any other tasks submitted to executors): {code:java} checkpointCoordinatorTimer.shutdownNow() // in ExecutionGraph scheduledExecutorService.shutdownNow(); // in JobManagerSharedServices {code} So only currently executing actions will complete, but not any queued. I think we SHOULD complete cleanup on shutdown and propose the following: # Replace shutdownNow with shutdown to allow cleanup to finish # Add awaitTermination (with timeout) # At least log the result of shutdownNow (list of runnables) WDYT [~trohrmann]? I'd create separate tickets for the latter two issues. > Checkpoint cleanup can kill JobMaster > ------------------------------------- > > Key: FLINK-20992 > URL: https://issues.apache.org/jira/browse/FLINK-20992 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.12.0 > Reporter: Till Rohrmann > Priority: Critical > Labels: pull-request-available > Fix For: 1.13.0, 1.12.2 > > > A user reported that cancelling a job can lead to an uncaught exception which > kills the {{JobMaster}}. The problem seems to be that the > {{CheckpointsCleaner}} might trigger {{CheckpointCoordinator}} actions after > the job has reached a terminal state and, thus, is shut down. Apparently, we > do not properly manage the lifecycles of {{CheckpointCoordinator}} and > checkpoint post clean up actions. > The uncaught exception is > {code} > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@41554407 > rejected from > java.util.concurrent.ScheduledThreadPoolExecutor@5d0ec6f7[Terminated, pool > size = 0, active threads = 0, queued tasks = 0, completed tasks = 25977] at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063 > at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830 > at > java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326 > at > java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533 > at > java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622 > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668 > at > org.apache.flink.runtime.concurrent.ScheduledExecutorServiceAdapter.execute(ScheduledExecutorServiceAdapter.java:62 > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.scheduleTriggerRequest(CheckpointCoordinator.java:1152 > at > org.apache.flink.runtime.checkpoint.CheckpointsCleaner.lambda$cleanCheckpoint$0(CheckpointsCleaner.java:58 > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149 > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624 > at java.lang.Thread.run(Thread.java:748 undefined) > {code} > cc [~roman_khachatryan]. > https://lists.apache.org/thread.html/r75901008d88163560aabb8ab6fc458cd9d4f19076e03ae85e00f9a3a%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)