Bhumika Bayani created FLINK-8624:
-------------------------------------
Summary: flink-mesos: The flink rest-api sometimes becomes
unresponsive
Key: FLINK-8624
URL: https://issues.apache.org/jira/browse/FLINK-8624
Project: Flink
Issue Type: Bug
Affects Versions: 1.3.2
Reporter: Bhumika Bayani
Sometimes flink-mesos-scheduler fails/get killed, and marathon brings it up
again on some other node. Sometimes we have observed, the rest-api of the newly
created flink instance becomes unresponsive.
Even if we execute api calls manually with curl, such as
http://<host>:<port>/overview or http://<host>:<port>/config
we do not receive any response.
We submit and execute all our flink-jobs using rest-api only. So if rest api
becomes un-responsive, that stops us from running any of the flink jobs and no
stream processing happens.
We tried enabling flink debug logs, but we did not observer anything specific
that indicates why rest api is failing/unresponsive.
We see below exceptions in logs but that is not specific to case when flink-api
is hung. We see them in healthy flink-scheduler too:
{code:java}
Timestamp=2018-02-08 05:43:49,175 LogLevel=INFO
ThreadId=[Checkpoint Timer] Class=o.a.f.r.c.CheckpointCoordinator
Msg=Triggering checkpoint 10181 @ 1518068629174
Timestamp=2018-02-08 05:43:49,183 LogLevel=DEBUG
ThreadId=[nioEventLoopGroup-5-3] Class=o.a.f.r.w.WebRuntimeMonitor
Msg=Unhandled exception: {}
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/jobmanager#753807801]] after [10000 ms]
at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at
akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
~[flink-dist_2.11-1.4-SNAPSHOT.jar:1.4-SNAPSHOT]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
{code}
During the time rest api is unresponsive, we have observed flink web UI too
does not load/show any information.
Restarting the flink-scheduler solves this issue sometimes.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)