[
https://issues.apache.org/jira/browse/FLINK-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kai Chen updated FLINK-11708:
-----------------------------
Description:
Currently *TaskExecutor.start()* method starts by connecting to the
ResourceManager and ends with a *startRegistrationTimeout()* method. However,
the *startRegistrationTimeout()* is * *probably* *called after the connection
to ResourceManager is established. Then in time of
"taskmanager.registration.timeout", *registrationTimeout()* is called and throw
a RegistrationTimeoutException bellow, which causes the TaskExecutor shutdown:
2019-02-19 18:00:32,825 ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred
while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration 300000 ms. This indicates a problem with this instance.
Terminating now.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$36/321628857.run(Unknown
Source)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
was:
Currently *TaskExecutor.start()* method starts by connecting to the
ResourceManager and ends with a *startRegistrationTimeout()* method. However,
the *startRegistrationTimeout()* is ** probably ** called after the connection
to ResourceManager is established. Then in time of
"taskmanager.registration.timeout", *registrationTimeout()* is called and throw
a RegistrationTimeoutException bellow, which causes the TaskExecutor shutdown:
2019-02-19 18:00:32,825 ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred
while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration 300000 ms. This indicates a problem with this instance.
Terminating now. at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$36/321628857.run(Unknown
Source) at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at
akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at
akka.actor.ActorCell.invoke(ActorCell.scala:495) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at
akka.dispatch.Mailbox.run(Mailbox.scala:224) at
akka.dispatch.Mailbox.exec(Mailbox.scala:234) at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> There is a big chance that a taskExecutor starts another RegistrationTimeout
> after successfully registered at RM
> ----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-11708
> URL: https://issues.apache.org/jira/browse/FLINK-11708
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.6.3, 1.7.2
> Reporter: Kai Chen
> Priority: Major
>
> Currently *TaskExecutor.start()* method starts by connecting to the
> ResourceManager and ends with a *startRegistrationTimeout()* method. However,
> the *startRegistrationTimeout()* is * *probably* *called after the
> connection to ResourceManager is established. Then in time of
> "taskmanager.registration.timeout", *registrationTimeout()* is called and
> throw a RegistrationTimeoutException bellow, which causes the TaskExecutor
> shutdown:
> 2019-02-19 18:00:32,825 ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this instance.
> Terminating now.
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$$Lambda$36/321628857.run(Unknown
> Source)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
> at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)