Yu Yang created FLINK-16429:
-------------------------------
Summary: failed to restore flink job from checkpoints due to
unhandled exceptions
Key: FLINK-16429
URL: https://issues.apache.org/jira/browse/FLINK-16429
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.9.1
Reporter: Yu Yang
We are trying to restore our flink job from check-points, and run into
AskTimeoutException related failures at a high frequency. Our environment is
Hadoop 2.7.1 + Yarn + Flink 1.9.1.
We hit this issue in 9 out of 10 runs, and were able to restore the application
from given check-points from time to time. As the application can be restored,
the check-point files shall not be corrupted. It seems that the issue is that
jobmaster got timeout when it handles job submission request.
Below is the exception stack trace, it is thrown from
[https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]
2020-03-05 00:04:14,360 ERROR
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled
exception: httpRequest uri:/v1/jobs, context:
ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER,
[id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44])
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of
type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical
reason for `AskTimeoutException` is that the recipient actor didn't send a
reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) at
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
at
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)