[
https://issues.apache.org/jira/browse/FLINK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009838#comment-16009838
]
Chesnay Schepler commented on FLINK-6440:
-----------------------------------------
I'm wondering what our options are here. We can't just disable the logging;
there is the possibility that only the {{MetricQueryService}} is unreachable
and this should be logged if that's the case.
We could limit the # of log messages in a given time frame, but this would mean
that an unreachable MQS may only be logged after a long long time.
Finally, we could track the unreachable status of the MQS; like a set that
contains the paths. If a request fails it is added to the set, and we only log
something when it is added to the set. Once a request succeeds it would be
removed again. Problem is that we then would need some time-based clean-up code
as the set could otherwise grow infinitely in cases where many TM's are being
replaced (and thus are never reachable again).
Sadly there isn't something like a {{TaskmanagerStatusListener}} interface,
this would be useful to track/clean-up state by {{TaskManager}}.
> Noisy logs from metric fetcher
> ------------------------------
>
> Key: FLINK-6440
> URL: https://issues.apache.org/jira/browse/FLINK-6440
> Project: Flink
> Issue Type: Bug
> Components: Webfrontend
> Affects Versions: 1.3.0
> Reporter: Stephan Ewen
> Priority: Critical
> Fix For: 1.3.0
>
>
> In cases where TaskManagers fail, the web frontend in the Job Manager starts
> logging the exception below every few seconds.
> I labeled this as critical, because it actually makes debugging in such a
> situation complicated through a log that is flooded with noise.
> {code}
> 2017-05-03 19:37:07,823 WARN
> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher - Fetching
> metrics failed.
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@herman:52175/user/MetricQueryService_136f717a6b91e248282cb2937d22088c]]
> after [10000 ms]
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)