asafm commented on code in PR #21488: URL: https://github.com/apache/pulsar/pull/21488#discussion_r1384747466
########## pip/pip-314.md: ########## @@ -0,0 +1,57 @@ +# PIP-314: Add metrics pulsar_subscription_redelivery_messages + +# Background knowledge + +## Delivery of messages in normal + +To simplify the description of the mechanism, let's take the policy [Auto Split Hash Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range) as an example: + +| `0 ~ 16,384` | `16,385 ~ 32,768` | `32,769 ~ 65,536` | +|-------------------|-------------------|--------------------------------| +| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- | + +- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is delivered to C1 +- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is delivered to C2 +- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is delivered to C3 + +# Motivation + +For the example above, if `C1` is stuck or consumed slowly, the Broker will push the entries that should be delivered to `C1` into a memory collection `redelivery_messages` and read next entries continue, then the collection `redelivery_messages` becomes larger and larger and take up a lot of memory. When sending messages, it will also determine the key of the entries in the collection `redelivery_messages`, affecting performance. + +# Goals +- Add metrics + - Broker level: + - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` to indicate the max one `{redelivery_messages}` of subscriptions in the broker, used by pushing an alert if it is too large. Nit: The Broker will print a log, which contains the name of the subscription(which has the maximum count of redelivery messages). This will help find the issue subscription. + - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` to indicate the memory usage of all `redelivery_messages` in the broker. This is helpful for memory health checks. +- Improve `Topic stats`. + - Add an attribute `redeliveryMessageCount` under `SubscriptionStats` + +Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages & pulsar_subscription_back_log` + +- `pulsar_subscription_unacked_messages`: the messages have been delivered to the client but have not been acknowledged yet. +- `pulsar_subscription_back_log`: how many messages should be acknowledged, contains delivered messages, and the messages which should be delivered. + +### Public API + +<strong>SubscriptionStats.java</strong> +```java +long getRedeliveryMessageCount(); +``` + +### Metrics + +**pulsar_broker_max_subscription_redelivery_messages_total** +- Description: the max one of `{pulsar_broker_memory_usage_of_redelivery_messages_bytes}` maintains by subscriptions in the broker. +- Attributes: `[cluster]` +- Unit: `Counter` + +**pulsar_broker_memory_usage_of_redelivery_messages_bytes** +- Description: the memory usage of all `redelivery_messages` in the broker. Review Comment: Ok, I have given it some thought. Here's what I think: I'm going to separate the solution into long and short term. ## Long term With each feature added you're bound to have more metric you'll need to add, which might be topic or subscription level (granularity). Stating that you can't because we have too many doesn't make a lot sense, monitoring wise. Working around it by introducing the alert logic into Pulsar, by printing or exporting topK or all greater Y (threshold) also doesn't seem plausible since: a. You'll have to do this for every metric - so you're going to need a framework for this. b. I think it's a bit convoluted architecture to introduce alerting logic inside. The core problem is that we export too many metrics, right? I think the solution lies within PIP-264. Allowing you to filter: decide which metrics you export, minimizes the amount of metrics. Allowing you to have the metric emitted in group granularity enables you to not export it topic level and only when needed - alert based - you open those metrics. In summary, I think PIP-264 solves this problem. ## Short term Still for now, we need a way to get alerted when topic A crosses threshold X of a certain metric. If you have too many topics, you are probably exporting metrics in namespace granularity, hence you can setup the alert on the namespace metric. Yes, it has the disadvantage of specifying a certain threshold for all - but you'll have this problem also if you embed the threshold logic inside the broker/proxy. Yeah I know you say - topK solves it. The problem with topK is that it's very confusing to the user: Metric for certain topic dipping in out and - just imagine the graph - super confusing. If you emit it topK as log, I think it's ok, but it will be super noisy, as compared to threshold, since you always have topK so you always print them. On threshold, you can at least print the topic once the threshold is crossed. Ok, now that you know a certain namespace is in trouble, you need to know which topics. For that I was thinking of several ideas which are not finalized, so just bringing up ideas: * We can include this metric in the topic stats endpoint. If we can emit topic stats per namespace, we can run the HTTP request, get it, and even improve it by specifying on the queryParams which fields (JSONPath) you're actually interested at to make the result small. When the alert fires, you can trigger a script which queries it, finds the topics that crossed the threshold and then create matching Alerts in PD for each one. ## Previous idea discussion * Embedding the threshold logic and print to log when ever the threshold is crossed. * This means we need to do this as a framework and not reinvent the wheel for each metric. * I'm not a big fan again, because this is alerting logic and should not be as a Pulsar feature. * Emitting this metric only for the topK topics * It can change very often (the topK) hence the graph will look sparse, cut ==> very confusing for the user. Again this is an alerting logic so it also doesn't make a a lot of sense to embed it inside Pulsar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
