asafm commented on code in PR #20859: URL: https://github.com/apache/pulsar/pull/20859#discussion_r1273558633
########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge Review Comment: I would like to quote the description of this section from the template: > Describes all the knowledge you need to know in order to understand all the other sections in this PIP > > * Give a high level explanation on all concepts you will be using throughout this document. For example, if you want to talk about Persistent Subscriptions, explain briefly (1 paragraph) what this is. If you're going to talk about Transaction Buffer, explain briefly what this is. > If you're going to change something specific, then go into more detail about it and how it works. > * Provide links where possible if a person wants to dig deeper into the background information. > What you wrote is the motivation - you described the problem user face today. That should be in the motivation section. How do you that? When you finish writing your design document, you go over and list any concept you have used, that the reader needs to know beforehand to understand this doc. Once you have that list, you then shortly describe them. In your case it would be: * backlog * subscription * Delayed messages (which you have excluded) * and more A good example, which matches your topic perfectly is #19601 ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. Review Comment: Let me quote the template for what should be written under motivation: > Describe the problem this proposal is trying to solve. > > * Explain what is the problem you're trying to solve - current situation. > * This section is the "Why" of your proposal. > Instead you explained why you want to make the addition of this duration metric. The solution (the duration metric in your case) should be described in the High Level Design section, and in there explain how it solves the problem described in the motivation. You can't use the solution already in the motivation section as you did. Just focus on describing the problem at hand. Also very important: You've used too many words which repeats and repeats the same idea and obscure the meaning. You can say exactly what you said in this section with half of the words. ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. Review Comment: Because? ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` Review Comment: Just explain why ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` Review Comment: You can't start with the last bit, which is only a flag for enabling this. First describe your solution - You're going to introduce a new metric, which measures the age. It's going to be calculated as ...; You will use a flag to avoid doing that, since it is included in the metrics, and it as performance issues, so not everybody would want that. ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. Review Comment: That's too obscure. What do you mean by that? ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` +* use currentTime - `earliestUnAckMessagePublishTime` represent `backlogDuration` +* if markDelete haven't changed, don't need to get the new `earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op + +## Public-facing Changes + + +### Configuration + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add the config + +# Backward & Forward Compatability + +## Revert + +* config `subscriptionBacklogDurationEnabled = false` in `broker.conf` +* lower the broker version + +## Upgrade + +* config `subscriptionBacklogDurationEnabled = true` in `broker.conf` + +# Alternatives + +* marDeletePosition changed every time change the `earliestUnAckMessagePublishTime`, It will be very frequent and consume performance Review Comment: You mean to say: Everytime markDeletePosition changes, we'll read its message and keep in memory it's timestatmp. When the metrics endpoint is invoked, we'll simpy read it from memory. The cost of this alternative is very high since in worst case we'll read all messages, but backwards which means they will likely not even be in cache. ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` +* use currentTime - `earliestUnAckMessagePublishTime` represent `backlogDuration` +* if markDelete haven't changed, don't need to get the new `earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op + +## Public-facing Changes + + +### Configuration + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add the config + +# Backward & Forward Compatability + +## Revert + +* config `subscriptionBacklogDurationEnabled = false` in `broker.conf` +* lower the broker version + +## Upgrade + +* config `subscriptionBacklogDurationEnabled = true` in `broker.conf` + +# Alternatives + +* marDeletePosition changed every time change the `earliestUnAckMessagePublishTime`, It will be very frequent and consume performance + +# General Notes +* If there are a large number of subscriptions, and markDelete postion + 1 does not exist in the cache, it may consume bookie performance Review Comment: Actually because you're reading backwards, given enough backlog, the last unack entry is most likely not in the cache ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` Review Comment: Your current suggestion is expensive, as you need to read the last unack message every time metrics endpoint in invoked. It takes time due to I/O, which kind of goes against the idea that metrics response should be fairly quick. Especially if you need to read a single message *for each* existing subscription. There's a good change the message won't be in the cache since you're reading backwards. I have another idea, which I haven't seen described in the alternatives. How about we an estimate of the backlog age. How? Each time we close a ledger (which should happen each size or time), we can write N amount of (entryId, publishTimestamp) pairs to the ledger metadata. For example if N=2, you can write the first message and last message timestamp. If you have 1000 messages in the ledger, it will look like: (1, msg1PublishTimestamp), (1000, msg1000PublishTimestamp). When the markDelete is on ledger M, we obtain those N points from the M's metadata, and we can use them to estimate the publish timestamp of the existing markDelete entryId. We find the close two points and perform linear interpolation equation to find the publish timestamp for entryId. I guess we can POC it to see if it produces reasonable results. ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` +* use currentTime - `earliestUnAckMessagePublishTime` represent `backlogDuration` +* if markDelete haven't changed, don't need to get the new `earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op + +## Public-facing Changes + + +### Configuration + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add the config + +# Backward & Forward Compatability + +## Revert + +* config `subscriptionBacklogDurationEnabled = false` in `broker.conf` +* lower the broker version + +## Upgrade + +* config `subscriptionBacklogDurationEnabled = true` in `broker.conf` + +# Alternatives + +* marDeletePosition changed every time change the `earliestUnAckMessagePublishTime`, It will be very frequent and consume performance + +# General Notes +* If there are a large number of subscriptions, and markDelete postion + 1 does not exist in the cache, it may consume bookie performance + +# Links + +* Mailing List discussion thread: Review Comment: Don't forget that ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. Review Comment: I think duration of backlog is something which doesn't make sense English wise. How about: `pulsar_subscription_backlog_age`? ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` +* use currentTime - `earliestUnAckMessagePublishTime` represent `backlogDuration` +* if markDelete haven't changed, don't need to get the new `earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op + +## Public-facing Changes Review Comment: I think it makes sense metrics exposed as metrics would somehow be consistent with the topic stats API. Today you have "earliestMsgPublishTimeInBacklog" argument to this API. How about once its true, we'll add the age variable you have for each subscription? ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat Review Comment: No. Quote again: > What this PIP intend to achieve once It's integrated into Pulsar. > Why does it benefit Pulsar. > You took elements from the detailed design and laid it out as a goal. Your goal, in light of the problems described in the motivation are: ... ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable Review Comment: No no. This is detailed design. Class names and variable name should reside in the detailed design. Here just describe the solution in high level. ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` +* use currentTime - `earliestUnAckMessagePublishTime` represent `backlogDuration` +* if markDelete haven't changed, don't need to get the new `earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op + +## Public-facing Changes + + +### Configuration + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` Review Comment: Before naming a configuration, you must thing of the user experience. First search for metrics related metric names to try to be more consistent. If you will, you will see many are of the form "expose*InPrometheus" ########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. Review Comment: Here you expose the subscription backlog age. We have a similar concept around backlog size. A subscription has backlog size, yet we also introduce backlog size in topic level. Maybe we should have a metric showing max() of this for age? pulsar_topic_max_subscription_age? Just an idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
