azagrebin opened a new pull request #13879: URL: https://github.com/apache/flink/pull/13879
If a physical slot future immediately fails for a new `SharedSlot` in `SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot` but we continue to add logical slots to this `SharedSlot`, eventually, the logical slot also fails and gets removed from the `SharedSlot` which gets released (state `RELEASED`). The subsequent attempts to add logical slots in the loop of `allocateLogicalSlotsFromSharedSlots` will fail the scheduling with the `ALLOCATED` state check because it will be `RELEASED`. The subsequent bulk timeout check will also not find the `SharedSlot` and fail with NPE. Hence, such `SharedSlot` with the immediately failed physical slot future should not be kept in the `SlotSharingExecutionSlotAllocator` and the logical slot requests depending on it can be immediately returned failed. The bulk timeout check does not need to be started because if some physical (and its logical) slot requests failed then the whole bulk will be canceled by scheduler. If the last assumption is not true for the future scheduling, this bulk failure might need additional explicit pending requests cancelation. We expect to refactor it for the declarative scheduling anyways. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
