Kai Chen created FLINK-11708:
--------------------------------

             Summary: There is a big chance that a taskExecutor starts another 
RegistrationTimeout after successfully registered at RM
                 Key: FLINK-11708
                 URL: https://issues.apache.org/jira/browse/FLINK-11708
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.7.2, 1.6.3
            Reporter: Kai Chen


Currently *TaskExecutor.start()* method starts by connecting to the 
ResourceManager and ends with a *startRegistrationTimeout()* method. However, 
the *startRegistrationTimeout()* is ** probably ** called after the connection 
to ResourceManager is established. Then in time of 
"taskmanager.registration.timeout", *registrationTimeout()* is called and throw 
a RegistrationTimeoutException bellow, which causes the TaskExecutor shutdown:

2019-02-19 18:00:32,825 ERROR 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred 
while executing the TaskManager. Shutting it down... 
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: 
Could not register at the ResourceManager within the specified maximum 
registration duration 300000 ms. This indicates a problem with this instance. 
Terminating now. at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$36/321628857.run(Unknown
 Source) at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) 
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) 
at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at 
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at 
akka.actor.ActorCell.invoke(ActorCell.scala:495) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at 
akka.dispatch.Mailbox.run(Mailbox.scala:224) at 
akka.dispatch.Mailbox.exec(Mailbox.scala:234) at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to