[jira] [Commented] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

Stephan Ewen (Jira) Thu, 05 Mar 2020 01:27:22 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051943#comment-17051943
 ]


Stephan Ewen commented on FLINK-16429:
--------------------------------------

Thank you for reporting this.

It might be similar to the issue where the creation of the Execution Graph 
takes too long and the REST handler's ask times out. That can happen for 
example due to some blocking calls when initializing the File System connectors 
for checkpoints or source/sinks.

Can you check if the job actually restores, and only the REST handlers report 
the timeout? Or is the restore actually failing? To find that out, you could 
check the logs of the master, or check the web UI periodically later, so see if 
a job ends up running after all.

> failed to restore flink job from checkpoints due to unhandled exceptions
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16429
>                 URL: https://issues.apache.org/jira/browse/FLINK-16429
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.1
>            Reporter: Yu Yang
>            Priority: Major
>
> We are trying to restore our flink job from check-points, and run into 
> AskTimeoutException related failures at a high frequency. Our environment is 
> Hadoop 2.7.1 + Yarn + Flink 1.9.1. 
> We hit this issue in 9 out of 10 runs, and were able to restore the 
> application from given check-points from time to time. As the application can 
> be restored, the check-point files shall not be corrupted. It seems that the 
> issue is that jobmaster got timeout when it handles job submission request.  
>  
> Below is the exception stack trace, it is thrown from
> [https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]
> 2020-03-05 00:04:14,360 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled 
> exception: httpRequest uri:/v1/jobs, context: 
> ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER,
>  [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of 
> type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
> at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
> at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
> at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

Reply via email to