[ 
https://issues.apache.org/jira/browse/FLINK-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek updated FLINK-8624:
------------------------------------
    Fix Version/s: 1.5.0

> flink-mesos: The flink rest-api sometimes becomes unresponsive
> --------------------------------------------------------------
>
>                 Key: FLINK-8624
>                 URL: https://issues.apache.org/jira/browse/FLINK-8624
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, REST
>    Affects Versions: 1.3.2
>            Reporter: Bhumika Bayani
>            Priority: Major
>             Fix For: 1.5.0
>
>
> Sometimes flink-mesos-scheduler fails/get killed, and marathon brings it up 
> again on some other node. Sometimes we have observed, the rest-api of the 
> newly created flink instance becomes unresponsive.
> Even if we execute api calls manually with curl, such as 
> http://<host>:<port>/overview or http://<host>:<port>/config
> we do not receive any response. 
> We submit and execute all our flink-jobs using rest-api only. So if rest api 
> becomes un-responsive, that stops us from running any of the flink jobs and 
> no stream processing happens. 
> We tried enabling flink debug logs, but we did not observer anything specific 
> that indicates why rest api is failing/unresponsive.
> We see below exceptions in logs but that is not specific to case when 
> flink-api is hung. We see them in healthy flink-scheduler too: 
>  
> {code:java}
> Timestamp=2018-02-08 05:43:49,175 LogLevel=INFO
>         ThreadId=[Checkpoint Timer] Class=o.a.f.r.c.CheckpointCoordinator 
> Msg=Triggering checkpoint 10181 @ 1518068629174
> Timestamp=2018-02-08 05:43:49,183 LogLevel=DEBUG
>         ThreadId=[nioEventLoopGroup-5-3] Class=o.a.f.r.w.WebRuntimeMonitor 
> Msg=Unhandled exception: {}
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/jobmanager#753807801]] after [10000 ms]
>         at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>  ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>  ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>  ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) 
> ~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> {code}
>  
> During the time rest api is unresponsive, we have observed flink web UI too 
> does not load/show any information. 
> Restarting the flink-scheduler solves this issue sometimes. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to