asafm commented on code in PR #21488:
URL: https://github.com/apache/pulsar/pull/21488#discussion_r1383316048


##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.

Review Comment:
   When you write "V1 is stuck" you mean the internal queue of C1 is full? When 
that happens, messages are continued to be read from the topic, but messages 
whose keys belongs to c1 will be placed in a buffer called Redelivery Mesages? 
   Is this buffer one for all "stuck" consumers or each consumer has it?



##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:

Review Comment:
   This described Key Shared. Is Redelivery or messages only relevant for Key 
Shared subscriptions? If so, I would state that in the background.



##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.
+
+# Goals
+- Add metrics
+  - Broker level:
+    - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` 
to indicate the max one `{redelivery_messages}` of subscriptions in the broker, 
used by pushing an alert if it is too large. Nit: The Broker will print a log, 
which contains the name of the subscription(which has the maximum count of 
redelivery messages). This will help find the issue subscription.
+    - Add a metric `pulsar_broker_memory_usage_of_redelivery_messages_bytes` 
to indicate the memory usage of all `redelivery_messages` in the broker. This 
is helpful for memory health checks.
+- Improve `Topic stats`.
+  - Add an attribute `redeliveryMessageCount` under `SubscriptionStats`
+
+Differ between `redelivery_messages` and `pulsar_subscription_unacked_messages 
& pulsar_subscription_back_log`
+
+- `pulsar_subscription_unacked_messages`: the messages have been delivered to 
the client but have not been acknowledged yet.
+- `pulsar_subscription_back_log`: how many messages should be acknowledged, 
contains delivered messages, and the messages which should be delivered.
+
+### Public API
+
+<strong>SubscriptionStats.java</strong>
+```java
+long getRedeliveryMessageCount();
+```
+
+### Metrics
+
+**pulsar_broker_max_subscription_redelivery_messages_total**
+- Description: the max one of 
`{pulsar_broker_memory_usage_of_redelivery_messages_bytes}` maintains by 
subscriptions in the broker.
+- Attributes: `[cluster]`
+- Unit: `Counter`
+
+**pulsar_broker_memory_usage_of_redelivery_messages_bytes**
+- Description: the memory usage of all `redelivery_messages` in the broker.

Review Comment:
   Let's take a similar problem in MySQL or Elasticsearch. You have so many 
queries entering the system, and you want to know the ones the are the slowest. 
You can enable slow-query log for example and it will log the slowest queries 
(above certain threshold). 
   
   We can decide to have an histogram for buffer size, with 0 buckets 
(basically count and max) on a namespace level. Users can be alerted when an 
unknown topic in a namespace have max of redelivery buffer size greater than 
certain threshold. If you want to query which  was that we can select 2 
mechanisms:
   1. Expose it via an HTTP endpoint. Maybe through something we don't have 
(like namespaceStats)? 
   2. Log it - Here we it means the user needs to define a threshold for 
logging, and we only log the subscription / topic name once we cross the 
threshold - once. 
   
   @codelipenghui Your suggestion is to also define threshold but expose the 
topics/subscriptions via Prometheus metrics and not logs, right?



##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.
+
+# Goals
+- Add metrics
+  - Broker level:
+    - Add a metric `pulsar_broker_max_subscription_redelivery_messages_total` 
to indicate the max one `{redelivery_messages}` of subscriptions in the broker, 
used by pushing an alert if it is too large. Nit: The Broker will print a log, 
which contains the name of the subscription(which has the maximum count of 
redelivery messages). This will help find the issue subscription.

Review Comment:
   This name is super confusing: Max of a (single) subscription of redelivery 
messages , then total?
   
   I'll add suggestions soon
   



##########
pip/pip-314.md:
##########
@@ -0,0 +1,57 @@
+# PIP-314: Add metrics pulsar_subscription_redelivery_messages
+
+# Background knowledge
+
+## Delivery of messages in normal
+
+To simplify the description of the mechanism, let's take the policy [Auto 
Split Hash 
Range](https://pulsar.apache.org/docs/3.0.x/concepts-messaging/#auto-split-hash-range)
  as an example:
+
+| `0 ~ 16,384`      | `16,385 ~ 32,768` | `32,769 ~ 65,536`              |
+|-------------------|-------------------|--------------------------------|
+| ------- C1 ------ | ------- C2 ------ | ------------- C3 ------------- |
+
+- If the entry key is between `-1(non-include) ~ 16,384(include)`, it is 
delivered to C1
+- If the entry key is between `16,384(non-include) ~ 32,768(include)`, it is 
delivered to C2
+- If the entry key is between `32,768(non-include) ~ 65,536(include)`, it is 
delivered to C3
+
+# Motivation
+
+For the example above, if `C1` is stuck or consumed slowly, the Broker will 
push the entries that should be delivered to `C1` into a memory collection 
`redelivery_messages` and read next entries continue, then the collection 
`redelivery_messages` becomes larger and larger and take up a lot of memory. 
When sending messages, it will also determine the key of the entries in the 
collection `redelivery_messages`, affecting performance.

Review Comment:
   What do you mean by "determine" the key ? Also, why doing that would ruin 
performance?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to