[
https://issues.apache.org/jira/browse/FLINK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-11215:
----------------------------------
Component/s: Distributed Coordination
> TaskExecutor RegistrationTimeoutException within the specified maximum
> registration duration 300000ms
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-11215
> URL: https://issues.apache.org/jira/browse/FLINK-11215
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Reporter: Liu
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2018-12-25-14-50-35-348.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Sometimes, job will fail after 5 minutes because register fail at resource
> manager.
> !https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!
> But it register successful 5 minutes ago in fact (Tag ljg is added by me for
> test).
> !image-2018-12-25-14-50-35-348.png!
> This problem appears for that the function startRegistrationTimeout in
> TaskExecutor.java is executed in multiple places.
> In the function start, it will be executed by
> resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in
> async way. Also, it will be executed in the end of the start function. The
> order of these two executions is not guaranteed but they will change the same
> variable currentRegistrationTimeoutId. If the async way is fast enough to
> execute startRegistrationTimeout() first. It will fail 5 minutes later for
> the startRegistrationTimeout's execution in the end of the start function.
> The solution is to put the function startRegistrationTimeout in the start
> function before resourceManagerLeaderRetriever.start() . After doing this,
> the problem never appears again.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)