[ 
https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375296#comment-15375296
 ] 

Till Rohrmann commented on FLINK-4152:
--------------------------------------

[~mxm]The restarted registration attempts are the observable symptoms caused by 
a different problem. 

The actual problem is that the {{YarnFlinkRessourceManager}} forgets about the 
registered task managers if the job manager loses its leadership. Each task 
manager has a resource ID with which it registers at the resource manager. The 
{{YarnFlinkResourceManager}} has two states for allocated resources: 
{{containersInLaunch}} and {{registeredWorkers}}. A container can only go from 
{{containersInLaunch}} to {{registeredWorkers}}. This also works for the 
initial registration. However, when the job manager loses its leadership and 
the {{registeredWorkers}} list is cleared, there is no longer an container in 
launch associated with the respective resource ID. Consequently, when the old 
task manager is being re-registered by the new leader, the registration is 
rejected.

This rejection is then sent to the task manager. Upon receiving a rejection, 
the task manager reschedules another registration attempt after waiting for 
some time. Here the problem is that the old registration attempts are not 
cancelled. Consequently, one will have multiple registration attempts taking 
place at the "same" time/concurrently. That's the reason why you observe many 
registration attempt messages in the log.

I think the symptom can be fixed by cancelling all currently active 
registration attempts when you want to restart the registration.

It is a bit unclear to me what the expected behaviour of the 
FlinkYarnResourceManager should be. In the {{jobManagerLostLeadership}} method 
where the {{registeredWorkers}} list is cleared, a comment says "all currently 
registered TaskManagers are put under "awaiting registration"". But there is no 
such state. Furthermore, I'm not sure whether registered TaskManagers have to 
re-register if only the job manager has failed.

Thus, I see two solutions. Either not clearing {{registeredWorkers}} or 
introducing a new state "awaiting registration" which keeps all formerly 
registered task managers which can be re-registered.

Maybe [~mxm] can give some input.

> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>
>                 Key: FLINK-4152
>                 URL: https://issues.apache.org/jira/browse/FLINK-4152
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, TaskManager, YARN Client
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>         Attachments: logs.tgz
>
>
> While testing Flink 1.1 I've found that the TaskManagers are logging many 
> messages when registering at the JobManager.
> This is the log file: 
> https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that 
> this is the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to