asafm commented on code in PR #21488:
URL: https://github.com/apache/pulsar/pull/21488#discussion_r1384747466


##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.
+
+# Goals
+- Add metrics
+  - Broker level:
+    - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` 
to indicate the max one `{redelivery_messages}` of subscriptions in the broker, 
used by pushing an alert if it is too large. Nit: The Broker will print a log, 
which contains the name of the subscription(which has the maximum count of 
redelivery messages). This will help find the issue subscription.
+    - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` 
to indicate the memory usage of all `redelivery_messages` in the broker. This 
is helpful for memory health checks.
+- Improve `Topic stats`.
+  - Add an attribute `redeliveryMessageCount` under `SubscriptionStats`
+
+Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages 
& pulsar_subscription_back_log`
+
+- `pulsar_subscription_unacked_messages`: the messages have been delivered to 
the client but have not been acknowledged yet.
+- `pulsar_subscription_back_log`: how many messages should be acknowledged, 
contains delivered messages, and the messages which should be delivered.
+
+### Public API
+
+<strong>SubscriptionStats.java</strong>
+```java
+long getRedeliveryMessageCount();
+```
+
+### Metrics
+
+**pulsar_broker_max_subscription_redelivery_messages_total**
+- Description: the max one of 
`{pulsar_broker_memory_usage_of_redelivery_messages_bytes}` maintains by 
subscriptions in the broker.
+- Attributes: `[cluster]`
+- Unit: `Counter`
+
+**pulsar_broker_memory_usage_of_redelivery_messages_bytes**
+- Description: the memory usage of all `redelivery_messages` in the broker.

Review Comment:
   Ok, I have given it some thought. Here's what I think:
   
   I'm going to separate the solution into long and short term.
   
   ## Long term
   With each feature added you're bound to have more metric you'll need to add, 
which might be topic or subscription level (granularity).
   Stating that you can't because we have too many doesn't make a lot sense, 
monitoring wise.
   Working around it by introducing the alert logic into Pulsar, by printing or 
exporting topK or all greater Y (threshold) also doesn't seem plausible since:
   a. You'll have to do this for every metric - so you're going to need a 
framework for this.
   b. I think it's a bit convoluted architecture to introduce alerting logic 
inside.
   
   The core problem is that we export too many metrics, right?
   
   I think the solution lies within PIP-264.
   Allowing you to filter: decide which metrics you export, minimizes the 
amount of metrics.
   Allowing you to have the metric emitted in group granularity enables you to 
not export it topic level and only when needed - alert based - you open those 
metrics.
   
   In summary, I think PIP-264 solves this problem.
   
   ## Short term
   Still for now, we need a way to get alerted when topic A crosses threshold X 
of a certain metric. 
   If you have too many topics, you are probably exporting metrics in namespace 
granularity, hence you can setup the alert on the namespace metric.
   Yes, it has the disadvantage of specifying a certain threshold for all - but 
you'll have this problem also if you embed the threshold logic inside the 
broker/proxy. Yeah I know you say - topK solves it. The problem with topK is 
that it's very confusing to the user: Metric for certain topic dipping in out 
and - just imagine the graph - super confusing. If you emit it topK as log, I 
think it's ok, but it will be super noisy, as compared to threshold, since you 
always have topK so you always print them. On threshold, you can at least print 
the topic once the threshold is crossed. 
   
   Ok, now that you know a certain namespace is in trouble, you need to know 
which topics. For that I was thinking of several ideas which are not finalized, 
so just bringing up ideas:
   * We can include this metric in the topic stats endpoint. If we can emit 
topic stats per namespace, we can run the HTTP request, get it, and even 
improve it by specifying on the queryParams which fields (JSONPath) you're 
actually interested at to make the result small. When the alert fires, you can 
trigger a script which queries it, finds the topics that crossed the threshold 
and then create matching Alerts in PD for each one. 
   
   ## Previous idea discussion
   * Embedding the threshold logic and print to log when ever the threshold is 
crossed.
     * This means we need to do this as a framework and not reinvent the wheel 
for each metric. 
     * I'm not a big fan again, because this is alerting logic and should not 
be as a Pulsar feature.
   * Emitting this metric only for the topK topics
     * It can change very often (the topK) hence the graph will look sparse, 
cut ==> very confusing for the user. Again this is an alerting logic so it also 
doesn't make a a lot of sense to embed it inside Pulsar.
       



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to