codelipenghui commented on code in PR #21488:
URL: https://github.com/apache/pulsar/pull/21488#discussion_r1383505861


##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.
+
+# Goals
+- Add metrics
+  - Broker level:
+    - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` 
to indicate the max one `{redelivery_messages}` of subscriptions in the broker, 
used by pushing an alert if it is too large. Nit: The Broker will print a log, 
which contains the name of the subscription(which has the maximum count of 
redelivery messages). This will help find the issue subscription.
+    - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` 
to indicate the memory usage of all `redelivery_messages` in the broker. This 
is helpful for memory health checks.
+- Improve `Topic stats`.
+  - Add an attribute `redeliveryMessageCount` under `SubscriptionStats`
+
+Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages 
& pulsar_subscription_back_log`
+
+- `pulsar_subscription_unacked_messages`: the messages have been delivered to 
the client but have not been acknowledged yet.
+- `pulsar_subscription_back_log`: how many messages should be acknowledged, 
contains delivered messages, and the messages which should be delivered.
+
+### Public API
+
+<strong>SubscriptionStats.java</strong>
+```java
+long getRedeliveryMessageCount();
+```
+
+### Metrics
+
+**pulsar_broker_max_subscription_redelivery_messages_total**
+- Description: the max one of 
`{pulsar_broker_memory_usage_of_redelivery_messages_bytes}` maintains by 
subscriptions in the broker.
+- Attributes: `[cluster]`
+- Unit: `Counter`
+
+**pulsar_broker_memory_usage_of_redelivery_messages_bytes**
+- Description: the memory usage of all `redelivery_messages` in the broker.

Review Comment:
   Yes, for Pulsar transactions, it also has a slow transactions endpoint to 
query all the slow transactions.
   I think the major part is about the integration with the alert systems?
   
   Let's take an example. Backlogs.
   
   If we only have broker level backlog, we can have a backlog limitation for 
each broker. But the limitation is not easy to set because it will related to 
the topics. Set to 100k for example, but maybe 10k topic, one topic only has 10 
backlogs. It shouldn't be a problem.
   
   But if we have a topic level metrics only for the top 100 topics with the 
large backlogs. We can set the backlog to 10k so that we can detect at most 100 
topics with the backlog issues.
   
   A REST API can also integrate with the alert systems, but it's hard to check 
historical data.
   
   Logs solution will work for this case; you can get historical data, but if 
you want to get the trend of the backlogs, it's not easy. We must set up 
another dashboard (Kibana) based on the logs. Because the trend is also 
important when troubleshooting problems. If the bottleneck is the consumer 
side, we should see things get better after consumers scale up.
   
   Sorry, I think I provided a wrong example before. The latency is not a good 
case. Counter and Gauge should be good cases.
   
   I will continue to think about the essential differences between different 
solutions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to