[jira] [Commented] (FLINK-11912) Expose per partition Kafka lag metric in Flink Kafka connector

Rong Rong (JIRA) Mon, 18 Mar 2019 08:28:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-11912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795108#comment-16795108
 ]


Rong Rong commented on FLINK-11912:
-----------------------------------

+1. I think this metrics will be extremely helpful in production to detect any 
problematic data / user logic.

On a higher level, several questions: 1) do we want to keep 
{{manualRegisteredMetricSet}} as a norm (e.g. in other connectors as well) in 
order to keep up with the changing connector library in the future? 2) what if 
some of the other metrics that might be useful in the future, do we add them to 
the {{manualRegisteredMetricsSet}} (e.g. in the producer thread as well?) or 
should we keep a global copy?

Another thing is do we want to align with the 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics]
 effort? seems like a "per-partition" metrics is always useful across different 
connectors, although the definition of "partition" might be different.

> Expose per partition Kafka lag metric in Flink Kafka connector
> --------------------------------------------------------------
>
>                 Key: FLINK-11912
>                 URL: https://issues.apache.org/jira/browse/FLINK-11912
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / Kafka
>    Affects Versions: 1.6.4, 1.7.2
>            Reporter: Shuyi Chen
>            Assignee: Shuyi Chen
>            Priority: Major
>
> In production, it's important that we expose the Kafka lag by partition 
> metric in order for users to diagnose which Kafka partition is lagging. 
> However, although the Kafka lag by partition metrics are available in 
> KafkaConsumer after 0.10.2,  Flink was not able to properly register it 
> because the metrics are only available after the consumer start polling data 
> from partitions. I would suggest the following fix:
> 1) In KafkaConsumerThread.run(), allocate a manualRegisteredMetricSet.
> 2) in the fetch loop, as KafkaConsumer discovers new partitions, manually add 
> MetricName for those partitions that we want to register into 
> manualRegisteredMetricSet. 
> 3) in the fetch loop, check if manualRegisteredMetricSet is empty. If not, 
> try to search for the metrics available in KafkaConsumer, and if found, 
> register it and remove the entry from manualRegisteredMetricSet. 
> The overhead of the above approach is bounded and only incur when discovering 
> new partitions, and registration is done once the KafkaConsumer have the 
> metrics exposed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11912) Expose per partition Kafka lag metric in Flink Kafka connector

Reply via email to