[ 
https://issues.apache.org/jira/browse/FLINK-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971359#comment-16971359
 ] 

Zhu Zhu commented on FLINK-14701:
---------------------------------

[~chesnay], what do you think of the issue and the proposed solution?
This issue also happens in 1.9 so I think we also need the fix there.

> Slot leaks if SharedSlotOversubscribedException happens
> -------------------------------------------------------
>
>                 Key: FLINK-14701
>                 URL: https://issues.apache.org/jira/browse/FLINK-14701
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0, 1.9.2
>            Reporter: Zhu Zhu
>            Priority: Blocker
>             Fix For: 1.10.0, 1.9.2
>
>
> If a {{SharedSlotOversubscribedException}} happens, the {{MultiTaskSlot}} 
> will release some of its child {{SingleTaskSlot}}. The triggered releasing 
> will trigger a re-allocation of the task slot right inside 
> {{SingleTaskSlot#release(...)}}. So that a previous allocation 
>  in {{SloSharingManager#allTaskSlots}} will be replaced by the new allocation 
> because they share the same {{slotRequestId}}.
> However, the {{SingleTaskSlot#release(...)}} will then invoke 
> {{MultiTaskSlot#releaseChild}} to release the previous allocation with the 
> {{slotRequestId}}, which will unexpectedly remove the new allocation from the 
> {{SloSharingManager}}.
> In this way, slot leak happens because the pending slot request is not 
> tracked by the {{SloSharingManager}} and cannot be released when its payload 
> terminates.
> A test case {{testNoSlotLeakOnSharedSlotOversubscribedException}} which 
> exhibits this issue can be found in this 
> [commit|https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14].
> The slot leak blocks the TPC-DS queries on flink 1.10, see FLINK-14674.
> To solve it, I'd propose to strengthen the {{MultiTaskSlot#releaseChild}} to 
> only remove its true child task slot from the {{SloSharingManager}}, i.e. add 
> a check {{if (child == allTaskSlots.get(child.getSlotRequestId()))}} before 
> invoking {{allTaskSlots.remove(child.getSlotRequestId())}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to