[
https://issues.apache.org/jira/browse/FLINK-21309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-21309:
-----------------------------------
Labels: auto-deprioritized-major stale-minor (was:
auto-deprioritized-major)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Minor but is unassigned and neither itself nor its Sub-Tasks have been updated
for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is
still Minor, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> Metrics of JobManager and TaskManager overwrite each other in pushgateway
> -------------------------------------------------------------------------
>
> Key: FLINK-21309
> URL: https://issues.apache.org/jira/browse/FLINK-21309
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
> Affects Versions: 1.9.0, 1.10.0, 1.11.0
> Environment: 1. Components :
> Flink 1.9.0/1.10.0/1.11.0 + Prometheus + Pushgateway + Yarn
> 2. Metrics Configuration in flink-conf.yaml :
> {code:java}
> metrics.reporter.promgateway.class:
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.jobName: myjob
> metrics.reporter.promgateway.randomJobNameSuffix: false{code}
>
> Reporter: jiguodai
> Priority: Minor
> Labels: auto-deprioritized-major, stale-minor
> Attachments: image-2021-02-05-21-07-42-292.png
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> When a flink job run on yarn, metrics of jobmanager and taskmanagers will
> overwrite each other. The phenomenon is that on one second you can find only
> jobmanager metrics on pushgateway web ui, while on the next second you can
> find only taskmanager metrics on pushgateway web ui, these two kinds of
> metrics appear alternately. One metric of taskmanager on grafana will be like
> below intermittently (this taskmanager metric disappear on grafana when
> jobmanager metrics overwrite taskmanager metrics):
> !image-2021-02-05-21-07-42-292.png!
> The real reason is that Flink PrometheusPushGatewayReporter use PUT style
> instead of POST style to push metrics to pushgateway, what's more,
> taskmanagers and jobmanager use the same jobName (the only grouping key)
> which we configured in flink-conf.yaml.
> Althought REST URLs are same as below,
> {code:java}
> /metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>}
> {code}
> PUT and POST caused different results, as we can see below :
> * PUT is used to push a group of metrics. All metrics with the grouping key
> specified in the URL are replaced by the metrics pushed with PUT.
> * POST works exactly like the PUT method but only metrics with the same name
> as the newly pushed metrics are replaced.
> For these reasons, it's better to use POST style to push metrics to
> pushgateway to prevent jobmanager metrics and taskmanager metrics from
> overwriting each other, so that we can get continuous graph on grafana. Maybe
> you will say that we can set
> {code:java}
> metrics.reporter.promgateway.randomJobNameSuffix: true{code}
> in flink-conf.yaml, in this way, jobName from different nodes will has a
> random suffix and metrics will not overwrite each other any more. While we
> should be aware that most of users tend to use jobName as filter condition in
> PromQL, and using regular expressions to find exact jobName will degrade the
> speed of data retrieval in prometheus.
> Everytime some body ask why metrics on grafana is discontinuous on Flink
> mailing list, i will tell him that you should change the style of pushing
> metrics to pushgateway from PUT to POST and then repackage the
> flink-metrics-prometheus module. So, why don't we solve the problem
> permanently now ? I hope to have the chance to solve the problem, sincerely.
> related links :
> [https://github.com/prometheus/pushgateway#put-method]
> [https://github.com/prometheus/pushgateway/issues/308]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)