lhotari commented on code in PR #23945: URL: https://github.com/apache/pulsar/pull/23945#discussion_r1993328954
########## pip/pip-406.md: ########## @@ -0,0 +1,152 @@ +# PIP-406: Introduce metrics related to dispatch_throttled_count + +# Background knowledge + +## Motivation + +Currently, users can monitor subscription backlogs using the `pulsar_subscription_back_log_no_delayed` metric. +However, if [dispatch throttling](https://pulsar.apache.org/docs/next/concepts-throttling/) is configured at the broker/topic/subscription level, +this metric may not accurately reflect whether the backlog is due to insufficient consumer capacity, as it could be caused by dispatch throttling. + +## Goals + +Introduce metrics to indicate the count of `messages/bytes throttled` for **broker/topic/subscription** level rate limit. +This allows users to write PromQL queries to identify subscriptions with high backlogs but low or no throttling, pinpointing backlogs caused by insufficient consumer capacity. + +## In Scope + +Broker Level: +- Introduce the metric `pulsar_broker_dispatch_throttled_msg_count` to represent the total count of messages throttled for a broker. +- Introduce the metric `pulsar_broker_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a broker. + +Topic Level: +- Introduce the metric `pulsar_dispatch_throttled_msg_count` to represent the total count of messages throttled for a topic. +- Introduce the metric `pulsar_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a topic. + +Subscription Level: +- Introduce the metric `pulsar_subscription_dispatch_throttled_msg_count` to represent the total count of messages throttled for a subscription. +- Introduce the metric `pulsar_subscription_dispatch_throttled_bytes_count` to represent the total count of bytes throttled for a subscription. + + +## Out of Scope +- These states are not persistent and will reset upon broker restart/ topic re-load / subscription reconnected. + +# High Level Design +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`. Increase these values in the `consumeDispatchQuota` method when the TokenBucket for messages or bytes is insufficient. +2. Output these fields when retrieving metrics. + + +# Detailed Design + +## Design & Implementation Details +1. Maintain `dispatchThrottleMsgCount` and `dispatchThrottleBytesCount` in `DispatchRateLimiter`: +```java + private final LongAdder dispatchThrottleMsgCount = new LongAdder(); + private final LongAdder dispatchThrottleBytesCount = new LongAdder(); +``` + +2. During each [consumeDispatchQuota](https://github.com/apache/pulsar/blob/c4cff0ab3dac169c0a1418ef2f63f61604f6278e/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/DispatchRateLimiter.java#L97-L104), +if token bucket is insufficient, increase these fields accordingly. +```diff + public void consumeDispatchQuota(long numberOfMessages, long byteSize) { + AsyncTokenBucket localDispatchRateLimiterOnMessage = dispatchRateLimiterOnMessage; + if (numberOfMessages > 0 && localDispatchRateLimiterOnMessage != null) { +- localDispatchRateLimiterOnMessage.consumeTokens(numberOfMessages); ++ if (!localDispatchRateLimiterOnMessage.consumeTokensAndCheckIfContainsTokens(numberOfMessages)) { ++ dispatchThrottleMsgCount.increment(); ++ } + } + AsyncTokenBucket localDispatchRateLimiterOnByte = dispatchRateLimiterOnByte; + if (byteSize > 0 && localDispatchRateLimiterOnByte != null) { +- localDispatchRateLimiterOnByte.consumeTokens(byteSize); ++ if (!localDispatchRateLimiterOnByte.consumeTokensAndCheckIfContainsTokens(byteSize)) { ++ dispatchThrottleBytesCount.increment(); ++ } Review Comment: > By doing this, we actually only need two metrics: > > * `pulsar_subscription_dispatch_throttled_msg_events` > > * Labels: > > * topic > * subscription > * `reason`: broker/topic/subscription > * `pulsar_subscription_dispatch_throttled_bytes_events` > > * Labels: > > * topic > * subscription > * `reason`: broker/topic/subscription @shibd Yes, this would be the way in this approach from the metrics perspective when they get exposed for scraping at the /metrics endpoint. > When a user wants to view all throttling occurrences for a subscription, they just need to use a query like the one below (without the reason label). In this case, there is no "user" in Pulsar who would do these queries with code. For metrics exposed to a metrics database like Prometheus or VictoriaMetrics, filtering by labels is how the "user" would use the metrics. Within Pulsar code base, there are these use cases to solve internally: - exposing metrics to /metrics endpoint - exposing stats to topic stats at subscription and topic level - exposing stats to namespace stats - exposing stats to broker stats There's already examples of other metrics where the values get aggregated from subscription level to the topic level and namespace level. A similar solution would be preferred also for this new metric. I found that the `msgBacklog` is such metric currently, there are also other examples. The difference here would be that at the subscription level, there would have to be 6 counters to hold the metrics. They would be `dispatchThrottledMsgEventsSubscriptionLevel`, `dispatchThrottledMsgEventsTopicLevel`, `dispatchThrottledMsgEventsBrokerLevel`, `dispatchThrottledBytesEventsSubscriptionLevel`, `dispatchThrottledBytesEventsTopicLevel` and `dispatchThrottledBytesEventsBrokerLevel`. When aggregating (rolling up) the metric values, these would be needed. These would also be needed for the /metrics endpoint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org