Andrey Zagrebin created FLINK-19142:
---------------------------------------
Summary: Investigate slot hijacking from preceding pipelined
regions after failover
Key: FLINK-19142
URL: https://issues.apache.org/jira/browse/FLINK-19142
Project: Flink
Issue Type: Improvement
Reporter: Andrey Zagrebin
The ticket originates from [this PR
discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].
The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy
to schedule subtasks into the slot where they were previously executed before a
failover. If the previous slot (AllocationID) is not available, we do not want
subtasks to take previous slots (AllocationIDs) of other subtasks.
The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the
bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The
previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk
can potentially hijack the previous slots from the preceding bulks. On the
other hand the previous AllocationIDs of other tasks should be taken if the
other tasks are not going to run at the same time, e.g. not enough resources
after failover or other bulks are done.
One way to do it may be to give to MergingSharedSlotProfileRetriever all
previous AllocationIDs of bulks which are going to run at the same time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)