[ 
https://issues.apache.org/jira/browse/FLINK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-23806:
----------------------------
    Description: 
When requested slots are not fulfilled in time, task failure will be triggered 
and all related tasks will be canceled and restarted. However, in this process, 
if a task is already assigned a slot, the slot will be returned to the slot 
pool and it will be immediately used to fulfill pending slot requests of the 
tasks which will soon be canceled. The execution version of those tasks are 
already bumped in {{DefaultScheduler#restartTasksWithDelay(...)}} so that the 
assignment will fail immediately and the slot will be returned to the slot pool 
and again used to fulfill pending slot requests. StackOverflow can happen in 
this way when there are many vertices, and fatal error can happen and lead to 
JM crash. A sample call stack is attached below.

To fix the problem, one way is to cancel the pending requests of all the tasks 
which will be canceled soon(i.e. tasks with version bumped) before canceling 
these tasks.

{panel}
...
        at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.releaseSharedSlot(SlotSharingExecutionSlotAllocator.java:242)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.releaseExternally(SharedSlot.java:281)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.removeLogicalSlotRequest(SharedSlot.java:242)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.returnLogicalSlot(SharedSlot.java:234)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:203)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) 
~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.returnSlotToOwner(SingleLogicalSlot.java:200)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.releaseSlot(SingleLogicalSlot.java:130)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.releaseSlotIfPresent(DefaultScheduler.java:542)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$8(DefaultScheduler.java:505)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
 ~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequest.fulfill(DeclarativeSlotPoolBridge.java:552)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequestSlotMatching.fulfillPendingRequest(DeclarativeSlotPoolBridge.java:587)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.newSlotsAreAvailable(DeclarativeSlotPoolBridge.java:171)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.lambda$freeReservedSlot$0(DefaultDeclarativeSlotPool.java:316)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeReservedSlot(DefaultDeclarativeSlotPool.java:313)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.releaseSlot(DeclarativeSlotPoolBridge.java:335)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
...

{panel}


  was:
When requested slots are not fulfilled in time, task failure will be triggered 
and all related tasks will be canceled and restarted. However, in this process, 
if a task is already assigned a slot, the slot will be returned to the slot 
pool and it will be immediately used to fulfill pending slot requests of the 
tasks which will soon be canceled. The execution version of those tasks are 
already bumped in {{DefaultScheduler#restartTasksWithDelay(...)}} so that the 
assignment will fail immediately and the slot will be returned to the slot pool 
and again used to fulfill pending slot requests. StackOverflow can happen in 
this way when there are many vertices, and fatal error can happen and lead to 
JM will crash. A sample call stack is attached below.

To fix the problem, one way is to cancel the pending requests of all the tasks 
which will be canceled soon(i.e. tasks with version bumped) before canceling 
these tasks.

{panel}
...
        at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.releaseSharedSlot(SlotSharingExecutionSlotAllocator.java:242)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.releaseExternally(SharedSlot.java:281)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.removeLogicalSlotRequest(SharedSlot.java:242)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.SharedSlot.returnLogicalSlot(SharedSlot.java:234)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:203)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) 
~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.returnSlotToOwner(SingleLogicalSlot.java:200)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.releaseSlot(SingleLogicalSlot.java:130)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.releaseSlotIfPresent(DefaultScheduler.java:542)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$8(DefaultScheduler.java:505)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
 ~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) 
~[?:1.8.0_102]
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequest.fulfill(DeclarativeSlotPoolBridge.java:552)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequestSlotMatching.fulfillPendingRequest(DeclarativeSlotPoolBridge.java:587)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.newSlotsAreAvailable(DeclarativeSlotPoolBridge.java:171)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.lambda$freeReservedSlot$0(DefaultDeclarativeSlotPool.java:316)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_102]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeReservedSlot(DefaultDeclarativeSlotPool.java:313)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.releaseSlot(DeclarativeSlotPoolBridge.java:335)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
        at 
org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
 ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
...

{panel}



> StackOverflowException can happen if a large scale job failed to acquire 
> enough slots in time
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23806
>                 URL: https://issues.apache.org/jira/browse/FLINK-23806
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.5, 1.13.2
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Critical
>             Fix For: 1.14.0, 1.12.6, 1.13.3
>
>
> When requested slots are not fulfilled in time, task failure will be 
> triggered and all related tasks will be canceled and restarted. However, in 
> this process, if a task is already assigned a slot, the slot will be returned 
> to the slot pool and it will be immediately used to fulfill pending slot 
> requests of the tasks which will soon be canceled. The execution version of 
> those tasks are already bumped in 
> {{DefaultScheduler#restartTasksWithDelay(...)}} so that the assignment will 
> fail immediately and the slot will be returned to the slot pool and again 
> used to fulfill pending slot requests. StackOverflow can happen in this way 
> when there are many vertices, and fatal error can happen and lead to JM 
> crash. A sample call stack is attached below.
> To fix the problem, one way is to cancel the pending requests of all the 
> tasks which will be canceled soon(i.e. tasks with version bumped) before 
> canceling these tasks.
> {panel}
> ...
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.releaseSharedSlot(SlotSharingExecutionSlotAllocator.java:242)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.SharedSlot.releaseExternally(SharedSlot.java:281)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.SharedSlot.removeLogicalSlotRequest(SharedSlot.java:242)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.SharedSlot.returnLogicalSlot(SharedSlot.java:234)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.lambda$returnSlotToOwner$0(SingleLogicalSlot.java:203)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) 
> ~[?:1.8.0_102]
>         at 
> java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717)
>  ~[?:1.8.0_102]
>         at 
> java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) 
> ~[?:1.8.0_102]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.returnSlotToOwner(SingleLogicalSlot.java:200)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.releaseSlot(SingleLogicalSlot.java:130)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.releaseSlotIfPresent(DefaultScheduler.java:542)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$8(DefaultScheduler.java:505)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) 
> ~[?:1.8.0_102]
>         at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>  ~[?:1.8.0_102]
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  ~[?:1.8.0_102]
>         at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) 
> ~[?:1.8.0_102]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequest.fulfill(DeclarativeSlotPoolBridge.java:552)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge$PendingRequestSlotMatching.fulfillPendingRequest(DeclarativeSlotPoolBridge.java:587)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.newSlotsAreAvailable(DeclarativeSlotPoolBridge.java:171)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.lambda$freeReservedSlot$0(DefaultDeclarativeSlotPool.java:316)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_102]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.freeReservedSlot(DefaultDeclarativeSlotPool.java:313)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge.releaseSlot(DeclarativeSlotPoolBridge.java:335)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
>         at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotProviderImpl.cancelSlotRequest(PhysicalSlotProviderImpl.java:112)
>  ~[flink-dist_2.11-1.13-vvr-4.0.7-SNAPSHOT.jar:1.13-vvr-4.0.7-SNAPSHOT]
> ...
> {panel}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to