azagrebin opened a new pull request #13879:
URL: https://github.com/apache/flink/pull/13879


   If a physical slot future immediately fails for a new `SharedSlot` in 
`SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot` but we continue to 
add logical slots to this `SharedSlot`, eventually, the logical slot also fails 
and gets removed from the `SharedSlot` which gets released (state `RELEASED`). 
The subsequent attempts to add logical slots in the loop of 
`allocateLogicalSlotsFromSharedSlots` will fail the scheduling
   with the `ALLOCATED` state check because it will be `RELEASED`.
   
   The subsequent bulk timeout check will also not find the `SharedSlot` and 
fail with NPE.
   
   Hence, such `SharedSlot` with the immediately failed physical slot future 
should not be kept in the `SlotSharingExecutionSlotAllocator` and the logical 
slot requests depending on it can be immediately returned failed. The bulk 
timeout check does not need to be started because if some physical (and its 
logical) slot requests failed then the whole bulk will be canceled by scheduler.
   
   If the last assumption is not true for the future scheduling, this bulk 
failure might need additional explicit pending requests cancelation. We expect 
to refactor it for the declarative scheduling anyways.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to