[ 
https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646157#comment-16646157
 ] 

Florian Schmidt commented on FLINK-10521:
-----------------------------------------

[~till.rohrmann] I attached the debug logs. Turning on debug log level in Flink 
shows
{code:java}
2018-10-11 08:47:19,101 DEBUG 
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization - Failed to 
serialize histogram.
java.util.ConcurrentModificationException
... (this one has a stacktrace)
{code}
and
{code:java}
2018-10-11 08:47:07,174 DEBUG 
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization  - Failed to 
serialize histogram.
java.lang.ArrayIndexOutOfBoundsException (this one does not have a stracktrace)
{code}
so it really looks like my sketched together Histogram implementation is at 
fault here.

 

Would you still consider this a bug in Flinks Metric System that no metrics are 
reported if one of them is implemented so that it might throw an Exception? In 
this case I can change the description of the issue to reflect what we found 
out so far. Otherwise, if this is expected behaviour, feel free to close this 
issue

> TaskManager metrics are not reported to prometheus after running a job
> ----------------------------------------------------------------------
>
>                 Key: FLINK-10521
>                 URL: https://issues.apache.org/jira/browse/FLINK-10521
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>    Affects Versions: 1.6.1
>         Environment: Flink 1.6.1 cluster with one taskmanager and one 
> jobmanager, prometheus and grafana, all started in a local docker environment.
> See sample project at: 
> https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
>            Reporter: Florian Schmidt
>            Priority: Major
>         Attachments: Screenshot 2018-10-10 at 11.32.59.png, prometheus.log, 
> taskmanager.log
>
>
> Update: This only seems to happen when my custom (admittedly poorly 
> implemented) Histogram is enabled. Still I think one poorly implemented 
> metric should not bring down the whole metrics system.
> --
> I'm using prometheus to collect the metrics from Flink, and I noticed that 
> shortly after running a job, metrics from the taskmanager will stop being 
> reported most of the time.
> Looking at the prometheus logs I can see that requests to 
> taskmanager:9249/metrics are correct when no job is running, but after 
> starting to run a job those requests will return an empty response with 
> increasing frequency, until at some point most of the requests are not 
> successful anymore. I was able to very this by running `curl 
> localhost:9249/metrics` inside the taskmanager container, where more often 
> that not the response was empty, instead of containing the expected metrics.
> In the attached image you can see that occasionally some requests succeed, 
> but there are some big gaps in between. Eventually it will stop to succeed 
> completely. The prometheus scrape interval is set to 1s.
> !Screenshot 2018-10-10 at 11.32.59.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to