[
https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636850#comment-16636850
]
Chesnay Schepler commented on FLINK-10484:
------------------------------------------
In FLINK-10243 we introduced a switch to reduce the amount of data for the
latency source, see
https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#metrics-latency-granularity.
This can be used to drastically reduce the number of latency metrics. We could
look into back-porting this.
The "cardinality explosion" is caused by introducing proper support for custom
tags, which we used here for consistency purposes as it was always a bit odd
that you only had a tag for the receiving operator ID, but not the source.
The issue of effectively uncontrollable tags (since they're unaffected by scope
formats) was raised before, like in FLINK-7935, but I haven't found time to
address it as it requires a more thorough rework of the internals. All the
tag-based scope goodies were pretty much tacked on after the fact, and now
things are scattered all over the place :(
> New latency tracking metrics format causes metrics cardinality explosion
> ------------------------------------------------------------------------
>
> Key: FLINK-10484
> URL: https://issues.apache.org/jira/browse/FLINK-10484
> Project: Flink
> Issue Type: Bug
> Components: Metrics
> Affects Versions: 1.6.0, 1.6.1, 1.5.4
> Reporter: Jamie Grier
> Assignee: Jamie Grier
> Priority: Critical
>
> The new metrics format for latency tracking causes huge metrics cardinality
> explosion due to the format and the fact that there is a metric reported for
> a every combination of source subtask index and operator subtask index.
> Yikes!
> This format is actually responsible for basically taking down our metrics
> system due to DDOSing our metrics servers (at Lyft).
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)