[ 
https://issues.apache.org/jira/browse/FLINK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu updated FLINK-11215:
------------------------
    Description: 
Sometimes, job will fail after 5 minutes because register fail at resource 
manager.

!https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!

But it register successful 5 minutes ago in fact (Tag ljg is added by me for 
test).

!image-2018-12-25-14-50-35-348.png!

This problem appears for that the function startRegistrationTimeout in 
TaskExecutor.java is executed in multiple places.

In the function start, it will be executed by 
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in 
async way. Also, it will be executed in the end of the start function. The 
order of these two executions is not guaranteed but they will change the same 
variable currentRegistrationTimeoutId. If the async way is fast enough to 
execute startRegistrationTimeout() first. It will fail 5 minutes later for the 
startRegistrationTimeout's execution in the end of the start function.

The solution is to put the function startRegistrationTimeout in the start 
function before resourceManagerLeaderRetriever.start() . After doing this, the 
problem never appears again.

 

  was:
Sometimes, job will fail after 5 minutes because register fail at resource 
manager.

!https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!

But it register successful 5 minutes ago in fact (Tag ljg is added by me for 
test).

!image-2018-12-25-14-50-35-348.png!

This problem appears for that the function startRegistrationTimeout in 
TaskExecutor.java is executed in multiple places.

In the function start, it will be executed by  
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in 
async way. Also, it will be executed in the end of the start function. The 
order of these two executions is not guaranteed but they will change the same 
variable currentRegistrationTimeoutId. If the async way is fast enough to 
execute startRegistrationTimeout() first. It will fail 5 minutes later for the 
startRegistrationTimeout's execution in the end of the start function.

The solution is to put the function startRegistrationTimeout in the start of 
the start function. After doing this, the problem never appears again.

 


> TaskExecutor RegistrationTimeoutException within the specified maximum 
> registration duration 300000ms
> -----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11215
>                 URL: https://issues.apache.org/jira/browse/FLINK-11215
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Liu
>            Priority: Major
>         Attachments: image-2018-12-25-14-50-35-348.png
>
>
> Sometimes, job will fail after 5 minutes because register fail at resource 
> manager.
> !https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2!
> But it register successful 5 minutes ago in fact (Tag ljg is added by me for 
> test).
> !image-2018-12-25-14-50-35-348.png!
> This problem appears for that the function startRegistrationTimeout in 
> TaskExecutor.java is executed in multiple places.
> In the function start, it will be executed by 
> resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in 
> async way. Also, it will be executed in the end of the start function. The 
> order of these two executions is not guaranteed but they will change the same 
> variable currentRegistrationTimeoutId. If the async way is fast enough to 
> execute startRegistrationTimeout() first. It will fail 5 minutes later for 
> the startRegistrationTimeout's execution in the end of the start function.
> The solution is to put the function startRegistrationTimeout in the start 
> function before resourceManagerLeaderRetriever.start() . After doing this, 
> the problem never appears again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to