[jira] [Updated] (FLINK-10521) Faulty Histogram stops Prometheus metrics from being reported

Florian Schmidt (JIRA) Tue, 16 Oct 2018 06:55:14 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-10521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Florian Schmidt updated FLINK-10521:
------------------------------------
    Description: 
In my setup I am using the prometheus reporter and a custom implemented 
histogram metric. After a while the histogram starts throwing exceptions 
(because it is rather poorly implemented). This causes all metrics on the 
taskmanager where the histogram is running to stop being reported. By looking 
at the prometheus logs you can see that requests to _taskmanager:9249/metrics_ 
will return an empty response when a metric is faulty.

 

Expected:

A faulty metrics implementation causes this specific metric to stop being 
reported

Actual:

A faulty metric will cause all metrics on that taskmanager to stop being 
reported

  was:
Update: This only seems to happen when my custom (admittedly poorly 
implemented) Histogram is enabled. Still I think one poorly implemented metric 
should not bring down the whole metrics system.

--

I'm using prometheus to collect the metrics from Flink, and I noticed that 
shortly after running a job, metrics from the taskmanager will stop being 
reported most of the time.

Looking at the prometheus logs I can see that requests to 
taskmanager:9249/metrics are correct when no job is running, but after starting 
to run a job those requests will return an empty response with increasing 
frequency, until at some point most of the requests are not successful anymore. 
I was able to very this by running `curl localhost:9249/metrics` inside the 
taskmanager container, where more often that not the response was empty, 
instead of containing the expected metrics.

In the attached image you can see that occasionally some requests succeed, but 
there are some big gaps in between. Eventually it will stop to succeed 
completely. The prometheus scrape interval is set to 1s.

!Screenshot 2018-10-10 at 11.32.59.png!

        Summary: Faulty Histogram stops Prometheus metrics from being reported  
(was: TaskManager metrics are not reported to prometheus after running a job)

> Faulty Histogram stops Prometheus metrics from being reported
> -------------------------------------------------------------
>
>                 Key: FLINK-10521
>                 URL: https://issues.apache.org/jira/browse/FLINK-10521
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>    Affects Versions: 1.6.1
>         Environment: Flink 1.6.1 cluster with one taskmanager and one 
> jobmanager, prometheus and grafana, all started in a local docker environment.
> See sample project at: 
> https://github.com/florianschmidt1994/flink-fault-tolerance-baseline
>            Reporter: Florian Schmidt
>            Priority: Major
>         Attachments: Screenshot 2018-10-10 at 11.32.59.png, prometheus.log, 
> taskmanager.log
>
>
> In my setup I am using the prometheus reporter and a custom implemented 
> histogram metric. After a while the histogram starts throwing exceptions 
> (because it is rather poorly implemented). This causes all metrics on the 
> taskmanager where the histogram is running to stop being reported. By looking 
> at the prometheus logs you can see that requests to 
> _taskmanager:9249/metrics_ will return an empty response when a metric is 
> faulty.
>  
> Expected:
> A faulty metrics implementation causes this specific metric to stop being 
> reported
> Actual:
> A faulty metric will cause all metrics on that taskmanager to stop being 
> reported



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-10521) Faulty Histogram stops Prometheus metrics from being reported

Reply via email to