[jira] [Comment Edited] (FLINK-19142) Local recovery can be broken if slot hijacking happened during a full restart

Zhu Zhu (Jira) Fri, 05 Nov 2021 02:26:33 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438675#comment-17438675
 ]


Zhu Zhu edited comment on FLINK-19142 at 11/5/21, 9:25 AM:
-----------------------------------------------------------

Fixed in master(1.15):
4618b6a6e0584fc9054505035bb6a3b4e951c937
4d9de6c5b826ab482f47f33c8da957ac9b550c32


was (Author: zhuzh):
Done via
4618b6a6e0584fc9054505035bb6a3b4e951c937
4d9de6c5b826ab482f47f33c8da957ac9b550c32

> Local recovery can be broken if slot hijacking happened during a full restart
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-19142
>                 URL: https://issues.apache.org/jira/browse/FLINK-19142
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Andrey Zagrebin
>            Assignee: Zhu Zhu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0, 1.14.1
>
>
> The ticket originates from [this PR 
> discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].
> The previous AllocationIDs are used by 
> PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot 
> where they were previously executed before a failover. If the previous slot 
> (AllocationID) is not available, we do not want subtasks to take previous 
> slots (AllocationIDs) of other subtasks.
> The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the 
> bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. 
> The previous AllocationIDs of other bulks stay unknown. Therefore, the 
> current bulk can potentially hijack the previous slots from the preceding 
> bulks. On the other hand the previous AllocationIDs of other tasks should be 
> taken if the other tasks are not going to run at the same time, e.g. not 
> enough resources after failover or other bulks are done.
> Local recovery can be broken due to this. e.g. when multiple regions of a 
> streaming job are restarted at the same time(due to global failover, or task 
> failover with `full` failover strategy).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-19142) Local recovery can be broken if slot hijacking happened during a full restart

Reply via email to