[jira] [Updated] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

Till Rohrmann (Jira) Fri, 27 Mar 2020 08:15:09 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-16429:
----------------------------------
    Description: 
We are trying to restore our flink job from check-points, and run into 
AskTimeoutException related failures at a high frequency. Our environment is 
Hadoop 2.7.1 + Yarn + Flink 1.9.1. 

We hit this issue in 9 out of 10 runs, and were able to restore the application 
from given check-points from time to time. As the application can be restored, 
the check-point files shall not be corrupted. It seems that the issue is that 
jobmaster got timeout when it handles job submission request.  

 

Below is the exception stack trace, it is thrown from

[https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]

2020-03-05 00:04:14,360 ERROR 
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled 
exception: httpRequest uri:/v1/jobs, context: 
ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER,
 [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) 
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of 
type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
reason for `AskTimeoutException` is that the recipient actor didn't send a 
reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
 at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
 at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648
 at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205
 at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601
 at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109
 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599
 at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235
 at java.lang.Thread.run(Thread.java:748 undefined)

  was:
We are trying to restore our flink job from check-points, and run into 
AskTimeoutException related failures at a high frequency. Our environment is 
Hadoop 2.7.1 + Yarn + Flink 1.9.1. 

We hit this issue in 9 out of 10 runs, and were able to restore the application 
from given check-points from time to time. As the application can be restored, 
the check-point files shall not be corrupted. It seems that the issue is that 
jobmaster got timeout when it handles job submission request.  

 

Below is the exception stack trace, it is thrown from

[https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]

2020-03-05 00:04:14,360 ERROR 
org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled 
exception: httpRequest uri:/v1/jobs, context: 
ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER,
 [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) 
akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of 
type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
reason for `AskTimeoutException` is that the recipient actor didn't send a 
reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) at 
akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
 at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
 at java.lang.Thread.run(Thread.java:748)


> failed to restore flink job from checkpoints due to unhandled exceptions
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16429
>                 URL: https://issues.apache.org/jira/browse/FLINK-16429
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.1
>            Reporter: Yu Yang
>            Priority: Major
>
> We are trying to restore our flink job from check-points, and run into 
> AskTimeoutException related failures at a high frequency. Our environment is 
> Hadoop 2.7.1 + Yarn + Flink 1.9.1. 
> We hit this issue in 9 out of 10 runs, and were able to restore the 
> application from given check-points from time to time. As the application can 
> be restored, the check-point files shall not be corrupted. It seems that the 
> issue is that jobmaster got timeout when it handles job submission request.  
>  
> Below is the exception stack trace, it is thrown from
> [https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]
> 2020-03-05 00:04:14,360 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled 
> exception: httpRequest uri:/v1/jobs, context: 
> ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER,
>  [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of 
> type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical 
> reason for `AskTimeoutException` is that the recipient actor didn't send a 
> reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
>  at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
>  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648
>  at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205
>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601
>  at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109
>  at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599
>  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235
>  at java.lang.Thread.run(Thread.java:748 undefined)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

Reply via email to