[jira] [Commented] (FLINK-4715) TaskManager should commit suicide after cancellation failure

ASF GitHub Bot (JIRA) Tue, 18 Oct 2016 05:20:08 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585320#comment-15585320
 ]


ASF GitHub Bot commented on FLINK-4715:
---------------------------------------

Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2652#discussion_r83840545
  
    --- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/taskmanager/TaskTest.java 
---
    @@ -565,6 +568,128 @@ public void testOnPartitionStateUpdate() throws 
Exception {
                verify(inputGate, 
times(1)).retriggerPartitionRequest(eq(partitionId.getPartitionId()));
        }
     
    +   /**
    +    * Tests that interrupt happens via watch dog if canceller is stuck in 
cancel.
    +    * Task cancellation blocks the task canceller. Interrupt after cancel 
via
    +    * cancellation watch dog.
    +    */
    +   @Test
    +   public void testWatchDogInterruptsTask() throws Exception {
    +           Configuration config = new Configuration();
    +           config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5);
    +           config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50);
    +
    +           Task task = createTask(InvokableBlockingInCancel.class, config);
    +           task.startTaskThread();
    +
    +           awaitLatch.await();
    +
    +           task.cancelExecution();
    +
    +           triggerLatch.await();
    +
    +           // No fatal error
    +           for (Object msg : taskManagerMessages) {
    +                   assertEquals(false, msg instanceof 
TaskManagerMessages.FatalError);
    +           }
    +   }
    +
    +   /**
    +    * The invoke() method holds a lock (trigger awaitLatch after 
acquisition)
    +    * and cancel cannot complete because it also tries to acquire the same 
lock.
    +    * This is resolved by the watch dog, no fatal error.
    +    */
    +   @Test
    +   public void testInterruptableSharedLockInInvokeAndCancel() throws 
Exception {
    +           Configuration config = new Configuration();
    +           config.setLong(TaskOptions.CANCELLATION_INTERVAL.key(), 5);
    +           config.setLong(TaskOptions.CANCELLATION_TIMEOUT.key(), 50);
    +
    +           Task task = 
createTask(InvokableInterruptableSharedLockInInvokeAndCancel.class, config);
    +           task.startTaskThread();
    +
    +           awaitLatch.await();
    +
    +           task.cancelExecution();
    +
    +           triggerLatch.await();
    --- End diff --
    
    how about adding `task.getExecutingThread().join()` instead of the using 
the trigger latch? Seems more intuitive and safer.


> TaskManager should commit suicide after cancellation failure
> ------------------------------------------------------------
>
>                 Key: FLINK-4715
>                 URL: https://issues.apache.org/jira/browse/FLINK-4715
>             Project: Flink
>          Issue Type: Improvement
>          Components: TaskManager
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Assignee: Ufuk Celebi
>             Fix For: 1.2.0
>
>
> In case of a failed cancellation, e.g. the task cannot be cancelled after a 
> given time, the {{TaskManager}} should kill itself. That way we guarantee 
> that there is no resource leak. 
> This behaviour acts as a safety-net against faulty user code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4715) TaskManager should commit suicide after cancellation failure

Reply via email to