[
https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647690#comment-16647690
]
Till Rohrmann commented on FLINK-10521:
---------------------------------------
I think a faulty {{Metric}} should not bring down the complete metrics system.
Thus, I would consider this a bug. But it would still be a good idea to rename
this issue. There might already be one for this problem. I guess [~Zentol]
might know.
> TaskManager metrics are not reported to prometheus after running a job
> ----------------------------------------------------------------------
>
> Key: FLINK-10521
> URL: https://issues.apache.org/jira/browse/FLINK-10521
> Project: Flink
> Issue Type: Bug
> Components: Metrics
> Affects Versions: 1.6.1
> Environment: Flink 1.6.1 cluster with one taskmanager and one
> jobmanager, prometheus and grafana, all started in a local docker environment.
> See sample project at:
> https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
> Reporter: Florian Schmidt
> Priority: Major
> Attachments: Screenshot 2018-10-10 at 11.32.59.png, prometheus.log,
> taskmanager.log
>
>
> Update: This only seems to happen when my custom (admittedly poorly
> implemented) Histogram is enabled. Still I think one poorly implemented
> metric should not bring down the whole metrics system.
> --
> I'm using prometheus to collect the metrics from Flink, and I noticed that
> shortly after running a job, metrics from the taskmanager will stop being
> reported most of the time.
> Looking at the prometheus logs I can see that requests to
> taskmanager:9249/metrics are correct when no job is running, but after
> starting to run a job those requests will return an empty response with
> increasing frequency, until at some point most of the requests are not
> successful anymore. I was able to very this by running `curl
> localhost:9249/metrics` inside the taskmanager container, where more often
> that not the response was empty, instead of containing the expected metrics.
> In the attached image you can see that occasionally some requests succeed,
> but there are some big gaps in between. Eventually it will stop to succeed
> completely. The prometheus scrape interval is set to 1s.
> !Screenshot 2018-10-10 at 11.32.59.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)