[ 
https://issues.apache.org/jira/browse/FLINK-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangjing updated FLINK-4535:
-----------------------------
    Description: 
When TaskExecutor register at ResourceManager, it takes the following 3 input 
parameter:
1. resourceManagerLeaderId:  the fencing token for the ResourceManager leader 
which is kept by taskExecutor who send the registration
2.  taskExecutorAddress: the address of taskExecutor
3. resourceID: The resource ID of the TaskExecutor that registers

ResourceManager need to process the event based on the following step:
1. check whether input resourceManagerLeaderId is as same as the current 
leadershipSessionId. If not, it means that maybe two or more resourceManager 
exists at the same time, and current resourceManager is not the proper rm. so 
it  rejects or ignores the registration
2. check whether exists a valid taskExecutor at the giving address by 
connecting to the address. Reject the registration from invalid address.
3. check whether it is a duplicate registration by input resourceId, reject the 
registration
4. keep resourceID and taskExecutorGateway mapping relationships, And 
Optionally keep resourceID and container mapping relationships in yarn mode.
5.  Create the connection between resourceManager and taskExecutor, and ensure 
its healthy based on heartbeat rpc calls between rm and tm.
6.  send registration successful ack to the taskExecutor.

Discussion:
Maybe we need import errorCode or several registration decline subclass to 
distinguish the different causes of decline registration. 

  was:
When TaskExecutor register at ResourceManager, it will have the following 3 
input parameter:
1. resourceManagerLeaderId:  the fencing token for the ResourceManager leader 
which is kept by taskExecutor who send the registration
2.  taskExecutorAddress: the address of taskExecutor
3. resourceID: The resource ID of the TaskExecutor that registers

ResourceManager need to process the event based on the following step:
1. check whether input resourceManagerLeaderId is as same as the current 
leadershipSessionId. If not, it means that maybe two or more resourceManager 
exists at the same time, and current resourceManager is not the proper rm. so 
it  rejects or ignores the registration
2. check whether exists a valid taskExecutor at the giving address by 
connecting to the address. Reject the registration from invalid address.
3. check whether it is a duplicate registration by input resourceId, reject the 
registration
4. keep resourceID and taskExecutorGateway mapping relationships, And 
Optionally keep resourceID and container mapping relationships in yarn mode.
5.  Create the connection between resourceManager and taskExecutor, and ensure 
its healthy based on heartbeat rpc calls between rm and tm.
6.  send registration successful ack to the taskExecutor.

Discussion:
Maybe we need import errorCode or several registration decline subclass to 
distinguish the different causes of decline registration. 


> ResourceManager process registration from  TaskExecutor
> -------------------------------------------------------
>
>                 Key: FLINK-4535
>                 URL: https://issues.apache.org/jira/browse/FLINK-4535
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
>
> When TaskExecutor register at ResourceManager, it takes the following 3 input 
> parameter:
> 1. resourceManagerLeaderId:  the fencing token for the ResourceManager leader 
> which is kept by taskExecutor who send the registration
> 2.  taskExecutorAddress: the address of taskExecutor
> 3. resourceID: The resource ID of the TaskExecutor that registers
> ResourceManager need to process the event based on the following step:
> 1. check whether input resourceManagerLeaderId is as same as the current 
> leadershipSessionId. If not, it means that maybe two or more resourceManager 
> exists at the same time, and current resourceManager is not the proper rm. so 
> it  rejects or ignores the registration
> 2. check whether exists a valid taskExecutor at the giving address by 
> connecting to the address. Reject the registration from invalid address.
> 3. check whether it is a duplicate registration by input resourceId, reject 
> the registration
> 4. keep resourceID and taskExecutorGateway mapping relationships, And 
> Optionally keep resourceID and container mapping relationships in yarn mode.
> 5.  Create the connection between resourceManager and taskExecutor, and 
> ensure its healthy based on heartbeat rpc calls between rm and tm.
> 6.  send registration successful ack to the taskExecutor.
> Discussion:
> Maybe we need import errorCode or several registration decline subclass to 
> distinguish the different causes of decline registration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to