[ 
https://issues.apache.org/jira/browse/FLINK-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878475#comment-16878475
 ] 

Andrey Zagrebin commented on FLINK-12736:
-----------------------------------------

Thanks for the corollary, [~till.rohrmann], I think it is a valid concern.

Alternatively to counters, we could use a simpler approach. We can mark the 
_taskManagerRegistration.getIdleSince()_ time before starting the async 'no 
partition' check. The TM can be released only if the idle time after the check 
matches the previously marked one. Otherwise we discard the release and start 
over after the next timeout. This way we could also guarantee that there was no 
resource allocation in between.

> ResourceManager may release TM with allocated slots
> ---------------------------------------------------
>
>                 Key: FLINK-12736
>                 URL: https://issues.apache.org/jira/browse/FLINK-12736
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Priority: Critical
>             Fix For: 1.9.0
>
>
> The {{ResourceManager}} looks out for TaskManagers that have not had any 
> slots allocated on them for a while, as these could be released to safe 
> resources. If such a TM is found the RM checks via an RPC call whether the TM 
> still holds any partitions. If no partition is held then the TM is released.
> However, in the RPC callback no check is made whether the TM is actually 
> _still_ idle. In the meantime a slot could've been allocated on the TM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to