Frederic Hemery created FLINK-24756:
---------------------------------------
Summary: Flink metric identifiers contain group variables.
Key: FLINK-24756
URL: https://issues.apache.org/jira/browse/FLINK-24756
Project: Flink
Issue Type: Improvement
Components: Runtime / Metrics
Reporter: Frederic Hemery
Metric identifiers are built by concatenating the closest
{{ComponentMetricGroup}} metric identifier (which is configurable) and the
whole hierarchy of groups that have been added.
In a monitoring system like Datadog, it poses a challenge because it is tricky
to aggregate across metric identifiers. Instead, it relies on the same metric
identifier and different tags to distinguish between different timeseries.
Using Flink Datadog integration, we get:
||Metric Name||Tags||
|flink.operator.KafkaSourceReader.topic.resources.partition.0.committedOffset|[topic:resources,partition:0]|
|flink.operator.KafkaSourceReader.topic.resources.partition.1.committedOffset|[topic:resources,partition:1]|
|flink.operator.KafkaSourceReader.topic.resources.partition.2.committedOffset|[topic:resources,partition:2]|
|flink.operator.KafkaSourceReader.topic.resources.partition.3.committedOffset|[topic:resources,partition:3]|
|flink.operator.KafkaSourceReader.topic.resources.partition.4.committedOffset|[topic:resources,partition:4]|
|flink.operator.KafkaSourceReader.topic.resources.partition.5.committedOffset|[topic:resources,partition:5]|
|flink.operator.KafkaSourceReader.topic.resources.partition.6.committedOffset|[topic:resources,partition:6]|
|...|...|
Instead, the native way to represent metrics in Datadog would be:
||Metric Name||Tags||
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:0]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:1]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:2]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:3]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:4]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:5]|
|flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:6]|
|...|...|
The recommended way to configure the scopes for the {{ComponentMetricGroup}} in
[Datadog Docs|https://docs.datadoghq.com/integrations/flink/#metric-collection]
is to remove all the scopes from the templates for the same reason.
The metric identifier is built from the scopes and the tags are built from the
variables. The issue seems to come from groups being part of both the scopes
and the user variables. We can override this behavior by creating a custom
metric group for user reported metrics but this is impossible to override for
metrics reported by Flink itself (in particular [native
RocksDB|https://github.com/apache/flink/blob/664fdaeaccf910c587f3478dd80bb327b441e85a/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBNativeMetricMonitor.java#L78-L80]
metrics and
[Kafka|https://github.com/apache/flink/blob/99c2a415e9eeefafacf70762b6f54070f7911ceb/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/internals/AbstractFetcher.java#L501-L506]
metrics).
I couldn't think of a simple, clean and backward compatible way to achieve such
a change though so I'm looking for feedback.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)