[ https://issues.apache.org/jira/browse/FLINK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-11215: ----------------------------------- Labels: pull-request-available (was: ) > TaskExecutor RegistrationTimeoutException within the specified maximum > registration duration 300000ms > ----------------------------------------------------------------------------------------------------- > > Key: FLINK-11215 > URL: https://issues.apache.org/jira/browse/FLINK-11215 > Project: Flink > Issue Type: Bug > Reporter: Liu > Priority: Major > Labels: pull-request-available > Attachments: image-2018-12-25-14-50-35-348.png > > > Sometimes, job will fail after 5 minutes because register fail at resource > manager. > !https://wiki.corp.kuaishou.com/download/attachments/113313620/image2018-12-14_20-29-41.png?version=1&modificationDate=1544790582000&api=v2! > But it register successful 5 minutes ago in fact (Tag ljg is added by me for > test). > !image-2018-12-25-14-50-35-348.png! > This problem appears for that the function startRegistrationTimeout in > TaskExecutor.java is executed in multiple places. > In the function start, it will be executed by > resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener()) in > async way. Also, it will be executed in the end of the start function. The > order of these two executions is not guaranteed but they will change the same > variable currentRegistrationTimeoutId. If the async way is fast enough to > execute startRegistrationTimeout() first. It will fail 5 minutes later for > the startRegistrationTimeout's execution in the end of the start function. > The solution is to put the function startRegistrationTimeout in the start > function before resourceManagerLeaderRetriever.start() . After doing this, > the problem never appears again. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)