[jira] [Updated] (FLINK-11215) TaskExecutor RegistrationTimeoutException within the specified maximum registration duration 300000ms

Till Rohrmann (JIRA) Mon, 04 Feb 2019 02:32:12 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-11215:
----------------------------------
    Component/s: Distributed Coordination

> TaskExecutor RegistrationTimeoutException within the specified maximum 
> registration duration 300000ms
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11215
>                 URL: https://issues.apache.org/jira/browse/FLINK-11215
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>            Reporter: Liu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2018-12-25-14-50-35-348.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sometimes, job will fail after 5 minutes because register fail at resource 
> manager.
> !https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!
> But it register successful 5 minutes ago in fact (Tag ljg is added by me for 
> test).
> !image-2018-12-25-14-50-35-348.png!
> This problem appears for that the function startRegistrationTimeout in 
> TaskExecutor.java is executed in multiple places.
> In the function start, it will be executed by 
> resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in 
> async way. Also, it will be executed in the end of the start function. The 
> order of these two executions is not guaranteed but they will change the same 
> variable currentRegistrationTimeoutId. If the async way is fast enough to 
> execute startRegistrationTimeout() first. It will fail 5 minutes later for 
> the startRegistrationTimeout's execution in the end of the start function.
> The solution is to put the function startRegistrationTimeout in the start 
> function before resourceManagerLeaderRetriever.start() . After doing this, 
> the problem never appears again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-11215) TaskExecutor RegistrationTimeoutException within the specified maximum registration duration 300000ms

Reply via email to