[ 
https://issues.apache.org/jira/browse/FLINK-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372821#comment-16372821
 ] 

ASF GitHub Bot commented on FLINK-8732:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5548#discussion_r169960392
  
    --- Diff: 
flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphSchedulingTest.java
 ---
    @@ -412,6 +417,54 @@ public void testEagerSchedulingWithSlotTimeout() 
throws Exception {
                verify(taskManager, 
times(0)).submitTask(any(TaskDeploymentDescriptor.class), any(Time.class));
        }
     
    +   /**
    +    * Tests that an ongoing scheduling operation does not fail the {@link 
ExecutionGraph}
    +    * if it gets concurrently cancelled
    +    */
    +   @Test
    +   public void testSchedulingOperationCancellationWhenCancel() throws 
Exception {
    +           final JobVertex jobVertex = new JobVertex("NoOp JobVertex");
    +           jobVertex.setInvokableClass(NoOpInvokable.class);
    +           jobVertex.setParallelism(2);
    +           final JobGraph jobGraph = new JobGraph(jobVertex);
    +           jobGraph.setScheduleMode(ScheduleMode.EAGER);
    +           jobGraph.setAllowQueuedScheduling(true);
    +
    +           final CompletableFuture<LogicalSlot> slotFuture1 = new 
CompletableFuture<>();
    +           final CompletableFuture<LogicalSlot> slotFuture2 = new 
CompletableFuture<>();
    +           final ProgrammedSlotProvider slotProvider = new 
ProgrammedSlotProvider(2);
    +           slotProvider.addSlots(jobVertex.getID(), new 
CompletableFuture[]{slotFuture1, slotFuture2});
    +           final ExecutionGraph executionGraph = 
createExecutionGraph(jobGraph, slotProvider);
    +
    +           executionGraph.scheduleForExecution();
    +
    +           final CompletableFuture<?> releaseFuture = new 
CompletableFuture<>();
    +
    +           final TestingLogicalSlot slot = new TestingLogicalSlot(
    +                   new LocalTaskManagerLocation(),
    +                   new SimpleAckingTaskManagerGateway(),
    +                   0,
    +                   new AllocationID(),
    +                   new SlotRequestId(),
    +                   new SlotSharingGroupId(),
    +                   releaseFuture);
    +           slotFuture1.complete(slot);
    +
    +           // cancel should change the state of all executions to CANCELLED
    +           executionGraph.cancel();
    +
    +           // complete the now CANCELLED execution --> this should cause a 
failure
    +           slotFuture2.complete(new TestingLogicalSlot());
    +
    +           Thread.sleep(1L);
    --- End diff --
    
    Yes.


> Cancel scheduling operation when cancelling the ExecutionGraph
> --------------------------------------------------------------
>
>                 Key: FLINK-8732
>                 URL: https://issues.apache.org/jira/browse/FLINK-8732
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> With the Flip-6 changes and the support for queued scheduling, the 
> {{ExecutionGraph}} must be able to handle cancellation calls when it is not 
> yet fully scheduled. This is for example the case when waiting for new 
> containers.
> A cancellation will cancel all {{Executions}}. As a result, available slots 
> can get assigned to other {{Executions}} (already canceled). Since the slot 
> cannot be assigned to this slot because it's already canceled, this can fail 
> the overall eager scheduling operation. The scheduling result callback will 
> then trigger a global fail operation. This can happen before all 
> {{Executions}} have been released and, thus, when the {{ExecutionGraph}} is 
> still in the state {{CANCELLING}}. The result is that the {{ExecutionGraph}} 
> goes into the state {{FAILING}} and then {{FAILED}}.
> In order to solve this problem, I propose to keep track of the scheduling 
> operation and cancelling the result future when a concurrent {{suspend}}, 
> {{cancel}} or {{fail}} call happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to