[ 
https://issues.apache.org/jira/browse/FLINK-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283945#comment-14283945
 ] 

Till Rohrmann commented on FLINK-1352:
--------------------------------------

That is a good point. I'll let the JobManager send a RegistrationRefused 
message to the TaskManager which will terminate itself in such a case.

I'm not sure if the TaskManager should try to connect to a JobManager only a 
limited number of times or indefinitely often. The pros of the former approach 
are that in case of a permanent JobManager outage, we don't have TaskManagers 
lingering around forever if not stopped manually. However, there is always the 
possibility that the connect to JobManager interval is too short to connect to 
a slow starting JobManager. Thus, some TaskManager might wrongly terminate. 
What do you think outweighs the other?

> Buggy registration from TaskManager to JobManager
> -------------------------------------------------
>
>                 Key: FLINK-1352
>                 URL: https://issues.apache.org/jira/browse/FLINK-1352
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, TaskManager
>    Affects Versions: 0.9
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>             Fix For: 0.9
>
>
> The JobManager's InstanceManager may refuse the registration attempt from a 
> TaskManager, because it has this taskmanager already connected, or,in the 
> future, because the TaskManager has been blacklisted as unreliable.
> Unpon refused registration, the instance ID is null, to signal that refused 
> registration. TaskManager reacts incorrectly to such methods, assuming 
> successful registration
> Possible solution: JobManager sends back a dedicated "RegistrationRefused" 
> message, if the instance manager returns null as the registration result. If 
> the TastManager receives that before being registered, it knows that the 
> registration response was lost (which should not happen on TCP and it would 
> indicate a corrupt connection)
> Followup question: Does it make sense to have the TaskManager trying 
> indefinitely to connect to the JobManager. With increasing interval (from 
> seconds to minutes)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to