apoorvmittal10 commented on code in PR #18444: URL: https://github.com/apache/kafka/pull/18444#discussion_r1909573910
########## core/src/main/java/kafka/server/share/SharePartitionManager.java: ########## @@ -693,6 +717,38 @@ private static void removeSharePartitionFromCache( } } + /** + * The handler to update the failed share fetch request metrics. + * + * @return A BiConsumer that updates the failed share fetch request metrics. + */ + private BiConsumer<Collection<TopicIdPartition>, Boolean> failedShareFetchMetricsHandler() { + return (topicIdPartitions, allTopicPartitionsFailed) -> { + // Update failed share fetch request metric. + topicIdPartitions.forEach(topicIdPartition -> + brokerTopicStats.topicStats(topicIdPartition.topicPartition().topic()).failedShareFetchRequestRate().mark()); + if (allTopicPartitionsFailed) { Review Comment: Yes, I thought about it while implementing and also saw failed fetch is populated as well when any fetch for topic fails. I was thinking in terms of metrics usage. Say for `failedShareFetchRequestRate`, If we always mark the all topic metric as failed when any one of the topic fetch failed then 2 metrics might not yeild major value. Then topic metric is more like a log which can help debug that which topic fetch has failed. Marking the all topic metric as failed when any topic fetch fails shall be desirable when complete request is failed on any topic fetch failure, which seems to be the case for regular [fetch](https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/core/src/main/scala/kafka/server/ReplicaManager.scala#L1890). But which is not true for share-fetch. I was thinking that for share-fetch if alltopic metric fails then it's critical (as complete fetch/acknowledge request failed), if topic level metric fails then operator should debug regarding what makes one of the topic failure (for single topic the metrics will yeild same result). Also I find bumping up all topic stats for each topic-partition in a request also not right - https://github.com/apache/kafka/blob/5684fc7a2ee1a4f29cb6d69d713233ed3c297882/core/src/main/scala/kafka/server/ReplicaManager.scala#L1453. As it will give incorrect request rate for overall metric. So avoided that as well in implementation. I know it's different than what we currently have but I am struggling to find value in existing implementation. I might be missing something hence open for suggestions so I can correct things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org