[jira] [Updated] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

Sihua Zhou (JIRA) Sun, 13 May 2018 22:42:12 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-9351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sihua Zhou updated FLINK-9351:
------------------------------
    Description: 
The steps are the following(copied from Stephan's comments in [5931 
title|https://github.com/apache/flink/pull/5931]):

- JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
- ResourceManager starts a container with a TaskManager
- TaskManager registers at ResourceManager, which tells the TaskManager to push 
a slot to the JobManager.
- TaskManager container is killed
- The ResourceManager does not queue back the slot requests (AllocationIDs) 
that it sent to the previous TaskManager, so the requests are lost and need to 
time out before another attempt is tried.

  was:
The steps are the following(copied from Stephan's comments in [5931 
title|https://github.com/apache/flink/pull/5931]):

JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
ResourceManager starts a container with a TaskManager
TaskManager registers at ResourceManager, which tells the TaskManager to push a 
slot to the JobManager.
TaskManager container is killed
The ResourceManager does not queue back the slot requests (AllocationIDs) that 
it sent to the previous TaskManager, so the requests are lost and need to time 
out before another attempt is tried.


> RM stop assigning slot to Job because the TM killed before connecting to JM 
> successfully
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-9351
>                 URL: https://issues.apache.org/jira/browse/FLINK-9351
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Priority: Critical
>
> The steps are the following(copied from Stephan's comments in [5931 
> title|https://github.com/apache/flink/pull/5931]):
> - JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
> - ResourceManager starts a container with a TaskManager
> - TaskManager registers at ResourceManager, which tells the TaskManager to 
> push a slot to the JobManager.
> - TaskManager container is killed
> - The ResourceManager does not queue back the slot requests (AllocationIDs) 
> that it sent to the previous TaskManager, so the requests are lost and need 
> to time out before another attempt is tried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

Reply via email to