[jira] [Comment Edited] (FLINK-6440) Noisy logs from metric fetcher

Chesnay Schepler (JIRA) Sun, 14 May 2017 13:24:46 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009838#comment-16009838
 ]


Chesnay Schepler edited comment on FLINK-6440 at 5/14/17 8:23 PM:
------------------------------------------------------------------

I'm wondering what our options are here. We can't just disable the logging; 
there is the possibility that only the {{MetricQueryService}} is unreachable 
and this should be logged if that's the case.

We could limit the # of log messages in a given time frame, but this would mean 
that an unreachable MQS may only be logged after a long long time.

Finally, we could track the unreachable status of the MQS for each TaskManager; 
like a set that contains the paths. If a request fails it is added to the set, 
and we only log something when it is added to the set. Once a request succeeds 
it would be removed again. Problem is that we then would need some time-based 
clean-up code as the set could otherwise grow infinitely in cases where many 
TM's are being replaced (and thus are never reachable again).

Sadly there isn't something like a {{TaskmanagerStatusListener}} interface, 
this would be useful to track/clean-up state by {{TaskManager}}.


was (Author: zentol):
I'm wondering what our options are here. We can't just disable the logging; 
there is the possibility that only the {{MetricQueryService}} is unreachable 
and this should be logged if that's the case.

We could limit the # of log messages in a given time frame, but this would mean 
that an unreachable MQS may only be logged after a long long time.

Finally, we could track the unreachable status of the MQS; like a set that 
contains the paths. If a request fails it is added to the set, and we only log 
something when it is added to the set. Once a request succeeds it would be 
removed again. Problem is that we then would need some time-based clean-up code 
as the set could otherwise grow infinitely in cases where many TM's are being 
replaced (and thus are never reachable again).

Sadly there isn't something like a {{TaskmanagerStatusListener}} interface, 
this would be useful to track/clean-up state by {{TaskManager}}.

> Noisy logs from metric fetcher
> ------------------------------
>
>                 Key: FLINK-6440
>                 URL: https://issues.apache.org/jira/browse/FLINK-6440
>             Project: Flink
>          Issue Type: Bug
>          Components: Webfrontend
>    Affects Versions: 1.3.0
>            Reporter: Stephan Ewen
>            Priority: Critical
>             Fix For: 1.3.0
>
>
> In cases where TaskManagers fail, the web frontend in the Job Manager starts 
> logging the exception below every few seconds.
> I labeled this as critical, because it actually makes debugging in such a 
> situation complicated through a log that is flooded with noise.
> {code}
> 2017-05-03 19:37:07,823 WARN  
> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher     - Fetching 
> metrics failed.
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://flink@herman:52175/user/MetricQueryService_136f717a6b91e248282cb2937d22088c]]
>  after [10000 ms]
>         at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>         at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
>         at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>         at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
>         at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>         at 
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (FLINK-6440) Noisy logs from metric fetcher

Reply via email to