[
https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Florian Schmidt updated FLINK-10521:
------------------------------------
Description:
In my setup I am using the prometheus reporter and a custom implemented
histogram metric. After a while the histogram starts throwing exceptions
(because it is rather poorly implemented). This causes all metrics on the
taskmanager where the histogram is running to stop being reported. By looking
at the prometheus logs you can see that requests to _taskmanager:9249/metrics_
will return an empty response when a metric is faulty.
Expected:
A faulty metrics implementation causes this specific metric to stop being
reported
Actual:
A faulty metric will cause all metrics on that taskmanager to stop being
reported
was:
Update: This only seems to happen when my custom (admittedly poorly
implemented) Histogram is enabled. Still I think one poorly implemented metric
should not bring down the whole metrics system.
--
I'm using prometheus to collect the metrics from Flink, and I noticed that
shortly after running a job, metrics from the taskmanager will stop being
reported most of the time.
Looking at the prometheus logs I can see that requests to
taskmanager:9249/metrics are correct when no job is running, but after starting
to run a job those requests will return an empty response with increasing
frequency, until at some point most of the requests are not successful anymore.
I was able to very this by running `curl localhost:9249/metrics` inside the
taskmanager container, where more often that not the response was empty,
instead of containing the expected metrics.
In the attached image you can see that occasionally some requests succeed, but
there are some big gaps in between. Eventually it will stop to succeed
completely. The prometheus scrape interval is set to 1s.
!Screenshot 2018-10-10 at 11.32.59.png!
Summary: Faulty Histogram stops Prometheus metrics from being reported
(was: TaskManager metrics are not reported to prometheus after running a job)
> Faulty Histogram stops Prometheus metrics from being reported
> -------------------------------------------------------------
>
> Key: FLINK-10521
> URL: https://issues.apache.org/jira/browse/FLINK-10521
> Project: Flink
> Issue Type: Bug
> Components: Metrics
> Affects Versions: 1.6.1
> Environment: Flink 1.6.1 cluster with one taskmanager and one
> jobmanager, prometheus and grafana, all started in a local docker environment.
> See sample project at:
> https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
> Reporter: Florian Schmidt
> Priority: Major
> Attachments: Screenshot 2018-10-10 at 11.32.59.png, prometheus.log,
> taskmanager.log
>
>
> In my setup I am using the prometheus reporter and a custom implemented
> histogram metric. After a while the histogram starts throwing exceptions
> (because it is rather poorly implemented). This causes all metrics on the
> taskmanager where the histogram is running to stop being reported. By looking
> at the prometheus logs you can see that requests to
> _taskmanager:9249/metrics_ will return an empty response when a metric is
> faulty.
>
> Expected:
> A faulty metrics implementation causes this specific metric to stop being
> reported
> Actual:
> A faulty metric will cause all metrics on that taskmanager to stop being
> reported
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)