[ https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585344#comment-15585344 ]
ASF GitHub Bot commented on FLINK-4715: --------------------------------------- Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/2652#discussion_r83842005 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskTest.java --- @@ -565,6 +568,128 @@ public void testOnPartitionStateUpdate() throws Exception { verify(inputGate, times(1)).retriggerPartitionRequest(eq(partitionId.getPartitionId())); } + /** + * Tests that interrupt happens via watch dog if canceller is stuck in cancel. + * Task cancellation blocks the task canceller. Interrupt after cancel via + * cancellation watch dog. + */ + @Test + public void testWatchDogInterruptsTask() throws Exception { + Configuration config = new Configuration(); + config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5); + config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50); + + Task task = createTask(InvokableBlockingInCancel.class, config); + task.startTaskThread(); + + awaitLatch.await(); + + task.cancelExecution(); + + triggerLatch.await(); + + // No fatal error + for (Object msg : taskManagerMessages) { + assertEquals(false, msg instanceof TaskManagerMessages.FatalError); + } + } + + /** + * The invoke() method holds a lock (trigger awaitLatch after acquisition) + * and cancel cannot complete because it also tries to acquire the same lock. + * This is resolved by the watch dog, no fatal error. + */ + @Test + public void testInterruptableSharedLockInInvokeAndCancel() throws Exception { + Configuration config = new Configuration(); + config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5); + config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50); + + Task task = createTask(InvokableInterruptableSharedLockInInvokeAndCancel.class, config); + task.startTaskThread(); + + awaitLatch.await(); + + task.cancelExecution(); + + triggerLatch.await(); + + // No fatal error + for (Object msg : taskManagerMessages) { + assertEquals(false, msg instanceof TaskManagerMessages.FatalError); --- End diff -- You can also give a message to `assertFalse` - I like assertEquals for printing the expected value, but if the expected value is false, the former seems more natural to me... > TaskManager should commit suicide after cancellation failure > ------------------------------------------------------------ > > Key: FLINK-4715 > URL: https://issues.apache.org/jira/browse/FLINK-4715 > Project: Flink > Issue Type: Improvement > Components: TaskManager > Affects Versions: 1.2.0 > Reporter: Till Rohrmann > Assignee: Ufuk Celebi > Fix For: 1.2.0 > > > In case of a failed cancellation, e.g. the task cannot be cancelled after a > given time, the {{TaskManager}} should kill itself. That way we guarantee > that there is no resource leak. > This behaviour acts as a safety-net against faulty user code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)