[
https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-9351:
---------------------------------
Fix Version/s: 1.6.0
> RM stop assigning slot to Job because the TM killed before connecting to JM
> successfully
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-9351
> URL: https://issues.apache.org/jira/browse/FLINK-9351
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.0
> Reporter: Sihua Zhou
> Priority: Critical
> Fix For: 1.6.0
>
>
> The steps are the following(copied from Stephan's comments in
> [5931|https://github.com/apache/flink/pull/5931]):
> - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
> - ResourceManager starts a container with a TaskManager
> - TaskManager registers at ResourceManager, which tells the TaskManager to
> push a slot to the JobManager.
> - TaskManager container is killed
> - The ResourceManager does not queue back the slot requests (AllocationIDs)
> that it sent to the previous TaskManager, so the requests are lost and need
> to time out before another attempt is tried.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)