[ https://issues.apache.org/jira/browse/FLINK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009838#comment-16009838 ]
Chesnay Schepler commented on FLINK-6440: ----------------------------------------- I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the {{MetricQueryService}} is unreachable and this should be logged if that's the case. We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time. Finally, we could track the unreachable status of the MQS; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again). Sadly there isn't something like a {{TaskmanagerStatusListener}} interface, this would be useful to track/clean-up state by {{TaskManager}}. > Noisy logs from metric fetcher > ------------------------------ > > Key: FLINK-6440 > URL: https://issues.apache.org/jira/browse/FLINK-6440 > Project: Flink > Issue Type: Bug > Components: Webfrontend > Affects Versions: 1.3.0 > Reporter: Stephan Ewen > Priority: Critical > Fix For: 1.3.0 > > > In cases where TaskManagers fail, the web frontend in the Job Manager starts > logging the exception below every few seconds. > I labeled this as critical, because it actually makes debugging in such a > situation complicated through a log that is flooded with noise. > {code} > 2017-05-03 19:37:07,823 WARN > org.apache.flink.runtime.webmonitor.metrics.MetricFetcher - Fetching > metrics failed. > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@herman:52175/user/MetricQueryService_136f717a6b91e248282cb2937d22088c]] > after [10000 ms] > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)