[
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348232#comment-16348232
]
James Cheng commented on KAFKA-6505:
------------------------------------
[~steff1193]: Yes, a KIP is still required even if you are only adding new
metrics.
The reason for this is that the metrics areĀ a "monitoring interface" that users
(operators?) will rely on, and that needs to be supported long-term. So the KIP
is the place where we have the discussions about use cases, naming, etc.
> Add simple raw "offset-commit-failures", "offset-commits" and
> "offset-commit-successes" count metric
> ----------------------------------------------------------------------------------------------------
>
> Key: KAFKA-6505
> URL: https://issues.apache.org/jira/browse/KAFKA-6505
> Project: Kafka
> Issue Type: Improvement
> Components: KafkaConnect
> Affects Versions: 1.0.0
> Reporter: Per Steffensen
> Priority: Minor
> Labels: needs-kip
>
> MBean
> "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x"
> has several attributes. Most of them seems to be avg/max/pct over the entire
> lifetime of the process. They are not very useful when monitoring a system,
> where you typically want to see when there have been problems and if there
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been
> failing (e.g. timing out) including if they are failing right now. It is
> really hard to do that properly, just using attribute
> "offset-commit-failure-percentage". You can expose a number telling how much
> the percentage has changed between two consecutive polls of the metric - if
> it changed to the positive side, we saw offset-commit failures, and if it
> changed to the negative side (or is stable at 0) we saw offset-commit success
> - at least as long as the system has not been running for so long that a
> single failing offset-commit does not even change the percentage. But it is
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting
> how many offset-commits have failed, as an ever-increasing number. Maybe also
> attributes "offset-commits" and "offset-commit-successes". Then I can do a
> delta between the two last metric-polls to show how many
> offset-commit-attempts have failed "very recently". Let this ticket be about
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire
> lifetime of the process - current state is what interests people, especially
> when it comes to failure-related metrics (failure-pct over the lifetime of
> the process is not very useful). And people will continuously be polling and
> storing the metrics, so we will have a history of "current state" somewhere
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring
> tools can do all the avg, max, pct for you based on a time-series of
> metrics-poll-results - and they can do it for periods of your choice (e.g.
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL
> (e.g. used through Grafana). Just expose the raw number and let the
> average/max/min/pct calculation be done on the collect/presentation side.
> Only do "advanced" stuff for cases that are very interesting and where it
> cannot be done based on simple raw number (e.g. percentiles), and consider
> whether doing it for fairly short intervals is better than for the entire
> lifetime of the process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)