[ https://issues.apache.org/jira/browse/FLINK-19142 ]
Zhu Zhu deleted comment on FLINK-19142:
---------------------------------
was (Author: flink-jira-bot):
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issue is assigned but has not
received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a
comment updating the community on your progress. If this issue is waiting on
feedback, please consider this a reminder to the committer/reviewer. Flink is a
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone
else may work on it.
> Local recovery can be broken if slot hijacking happened during a full restart
> -----------------------------------------------------------------------------
>
> Key: FLINK-19142
> URL: https://issues.apache.org/jira/browse/FLINK-19142
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Andrey Zagrebin
> Assignee: Zhu Zhu
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.0, 1.14.1
>
>
> The ticket originates from [this PR
> discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].
> The previous AllocationIDs are used by
> PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot
> where they were previously executed before a failover. If the previous slot
> (AllocationID) is not available, we do not want subtasks to take previous
> slots (AllocationIDs) of other subtasks.
> The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the
> bulk from SlotSharingExecutionSlotAllocator but only from the current bulk.
> The previous AllocationIDs of other bulks stay unknown. Therefore, the
> current bulk can potentially hijack the previous slots from the preceding
> bulks. On the other hand the previous AllocationIDs of other tasks should be
> taken if the other tasks are not going to run at the same time, e.g. not
> enough resources after failover or other bulks are done.
> Local recovery can be broken due to this. e.g. when multiple regions of a
> streaming job are restarted at the same time(due to global failover, or task
> failover with `full` failover strategy).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)