[
https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637332#comment-16637332
]
Jamie Grier commented on FLINK-10484:
-------------------------------------
[~Zentol] Great. I didn't see that this had already been addressed in 1.7.
What do you think about the difficulty of backporting to 1.5 and 1.6?
Currently, it's a pretty big problem for people trying to run Flink at any
reasonable scale – and since latency tracking is on by default basically
everything breaks as soon as you upgrade a job from 1.4 to 1.5. Also, latency
tracking is something that has to be disabled from application code rather than
in the flink-conf.yaml file so it's very hard for infra teams supporting Flink
to enforce.
It's also not just a problem for Flink JM – but in our case we actually caused
an observability incident company wide just because of the sheer volume of
metrics being thrown at our metrics servers.
> New latency tracking metrics format causes metrics cardinality explosion
> ------------------------------------------------------------------------
>
> Key: FLINK-10484
> URL: https://issues.apache.org/jira/browse/FLINK-10484
> Project: Flink
> Issue Type: Bug
> Components: Metrics
> Affects Versions: 1.6.0, 1.6.1, 1.5.4
> Reporter: Jamie Grier
> Assignee: Jamie Grier
> Priority: Critical
>
> The new metrics format for latency tracking causes huge metrics cardinality
> explosion due to the format and the fact that there is a metric reported for
> a every combination of source subtask index and operator subtask index.
> Yikes!
> This format is actually responsible for basically taking down our metrics
> system due to DDOSing our metrics servers (at Lyft).
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)