asafm commented on code in PR #20859:
URL: https://github.com/apache/pulsar/pull/20859#discussion_r1273558633


##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge

Review Comment:
   I would like to quote the description of this section from the template:
   
   > Describes all the knowledge you need to know in order to understand all 
the other sections in this PIP
   > 
   > * Give a high level explanation on all concepts you will be using 
throughout this document. For example, if you want to talk about Persistent 
Subscriptions, explain briefly (1 paragraph) what this is. If you're going to 
talk about Transaction Buffer, explain briefly what this is. 
   >   If you're going to change something specific, then go into more detail 
about it and how it works. 
   > * Provide links where possible if a person wants to dig deeper into the 
background information. 
   > 
   
   What you wrote is the motivation - you described the problem user face 
today. That should be in the motivation section.
   
   How do you that?
   When you finish writing your design document, you go over and list any 
concept you have used, that the reader needs to know beforehand to understand 
this doc. Once you have that list, you then shortly describe them.
   In your case it would be:
   * backlog
   * subscription
   * Delayed messages (which you have excluded)
   * and more
   
   A good example, which matches your topic perfectly is #19601



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.

Review Comment:
   Let me quote the template for what should be written under motivation:
   
   > Describe the problem this proposal is trying to solve.
   > 
   > * Explain what is the problem you're trying to solve - current situation.
   > * This section is the "Why" of your proposal.
   > 
   
   Instead you explained why you want to make the addition of this duration 
metric. The solution  (the duration metric in your case) should be described in 
the High Level Design section, and in there explain how it solves the problem 
described in the motivation. You can't use the solution already in the 
motivation section as you did.
   
   Just focus on describing the problem at hand. 
   
   Also very important: You've used too many words which repeats and repeats 
the same idea and obscure the meaning. You can say exactly what you said in 
this section with half of the words. 



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.

Review Comment:
   Because?



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`

Review Comment:
   Just explain why



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`

Review Comment:
   You can't start with the last bit, which is only a flag for enabling this. 
First describe your solution - You're going to introduce a new metric, which 
measures the age. It's going to be calculated as ...; You will use a flag to 
avoid doing that, since it is included in the metrics, and it as performance 
issues, so not everybody would want that.
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.

Review Comment:
   That's too obscure. What do you mean by that?



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`
+* use currentTime - `earliestUnAckMessagePublishTime` represent 
`backlogDuration`
+* if markDelete haven't changed, don't need to get the new 
`earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op
+
+## Public-facing Changes
+
+
+### Configuration
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add the config
+
+# Backward & Forward Compatability
+
+## Revert
+
+* config `subscriptionBacklogDurationEnabled = false` in `broker.conf`
+* lower the broker version
+
+## Upgrade
+
+* config `subscriptionBacklogDurationEnabled = true` in `broker.conf`
+
+# Alternatives
+
+* marDeletePosition changed every time change the 
`earliestUnAckMessagePublishTime`, It will be very frequent and consume 
performance

Review Comment:
   You mean to say:
   Everytime markDeletePosition changes, we'll read its message and keep in 
memory it's timestatmp. When the metrics endpoint is invoked, we'll simpy read 
it from memory. The cost of this alternative is very high since in worst case 
we'll read all messages, but backwards which means they will likely not even be 
in cache.
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`
+* use currentTime - `earliestUnAckMessagePublishTime` represent 
`backlogDuration`
+* if markDelete haven't changed, don't need to get the new 
`earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op
+
+## Public-facing Changes
+
+
+### Configuration
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add the config
+
+# Backward & Forward Compatability
+
+## Revert
+
+* config `subscriptionBacklogDurationEnabled = false` in `broker.conf`
+* lower the broker version
+
+## Upgrade
+
+* config `subscriptionBacklogDurationEnabled = true` in `broker.conf`
+
+# Alternatives
+
+* marDeletePosition changed every time change the 
`earliestUnAckMessagePublishTime`, It will be very frequent and consume 
performance
+
+# General Notes
+* If there are a large number of subscriptions, and markDelete postion + 1 
does not exist in the cache, it may consume bookie performance

Review Comment:
   Actually because you're reading backwards, given enough backlog, the last 
unack entry is most likely not in the cache



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`

Review Comment:
   Your current suggestion is expensive, as you need to read the last unack 
message every time metrics endpoint in invoked. It takes time due to I/O, which 
kind of goes against the idea that metrics response should be fairly quick. 
Especially if you need to read a single message *for each* existing 
subscription.
   There's a good change the message won't be in the cache since you're reading 
backwards.
   
   I have another idea, which I haven't seen described in the alternatives.
   How about we an estimate of the backlog age.
   How?
   Each time we close a ledger (which should happen each size or time), we can 
write N amount of (entryId, publishTimestamp) pairs to the ledger metadata. For 
example if N=2, you can write the first message and last message timestamp. If 
you have 1000 messages in the ledger, it will look like: (1, 
msg1PublishTimestamp), (1000, msg1000PublishTimestamp).
   
   When the markDelete is on ledger M, we obtain those N points from the M's 
metadata, and we can use them to estimate the publish timestamp of the existing 
markDelete entryId. We find the close two points and perform linear 
interpolation equation to find the publish timestamp for entryId.
   
   I guess we can POC it to see if it produces reasonable results.
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`
+* use currentTime - `earliestUnAckMessagePublishTime` represent 
`backlogDuration`
+* if markDelete haven't changed, don't need to get the new 
`earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op
+
+## Public-facing Changes
+
+
+### Configuration
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add the config
+
+# Backward & Forward Compatability
+
+## Revert
+
+* config `subscriptionBacklogDurationEnabled = false` in `broker.conf`
+* lower the broker version
+
+## Upgrade
+
+* config `subscriptionBacklogDurationEnabled = true` in `broker.conf`
+
+# Alternatives
+
+* marDeletePosition changed every time change the 
`earliestUnAckMessagePublishTime`, It will be very frequent and consume 
performance
+
+# General Notes
+* If there are a large number of subscriptions, and markDelete postion + 1 
does not exist in the cache, it may consume bookie performance
+
+# Links
+
+* Mailing List discussion thread:

Review Comment:
   Don't forget that



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.

Review Comment:
   I think duration of backlog is something which doesn't make sense English 
wise.
   How about: `pulsar_subscription_backlog_age`?
   
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`
+* use currentTime - `earliestUnAckMessagePublishTime` represent 
`backlogDuration`
+* if markDelete haven't changed, don't need to get the new 
`earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op
+
+## Public-facing Changes

Review Comment:
   I think it makes sense metrics exposed as metrics would somehow be 
consistent with the topic stats API.
   Today you have "earliestMsgPublishTimeInBacklog" argument to this API.
   How about once its true, we'll add the age variable you have for each 
subscription?
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat

Review Comment:
   No.
   Quote again:
   
   > What this PIP intend to achieve once It's integrated into Pulsar.
   > Why does it benefit Pulsar.
   > 
   
   You took elements from the detailed design and laid it out as a goal.
   
   Your goal, in light of the problems described in the motivation are: ...
   
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable

Review Comment:
   No no. This is detailed design. Class names and variable name should reside 
in the detailed design.
   
   Here just describe the solution in high level.
   



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.
+
+Maintaining the health of subscriptions is of paramount importance for the 
smooth operation of a messaging system. As message delivery involves 
interaction between producers and consumers, backlogs or unacknowledged 
messages can lead to data delays or losses. By monitoring 
"pulsar_subscription_backlog_duration," we can gain real-time insights into the 
duration of message backlogs and promptly detect any potential processing 
issues within subscriptions.
+
+The configuration and alerting settings for this monitoring item play a 
crucial role in responding swiftly to issues. When 
"pulsar_subscription_backlog_duration" indicates an abnormal increase in 
duration or unusual message backlogs, system administrators receive immediate 
alert notifications. These alerts enable administrators to quickly identify 
problems and take necessary measures to prevent message losses or further 
delays.
+
+In conclusion, the introduction of the "pulsar_subscription_backlog_duration" 
monitoring item enables effective monitoring of subscription health, real-time 
issue detection, and prevention of message delivery delays and losses. 
Additionally, timely alerting mechanisms empower proactive responses, ensuring 
the reliability and efficiency of the messaging system. This is essential for 
providing high-quality message delivery services, ensuring user experiences, 
and maintaining data integrity.
+
+# Goals
+
+## In Scope
+
+* SubscriptionStatsImpl add this stat
+* Metrics
+
+
+## Out of Scope
+
+* Implementing changes to the core functionality of the Pulsar messaging 
system itself.
+* Not include `NonPersistentTopic`.
+* Not include `DelayMessage`
+
+# High Level Design
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`
+  * note: because we need to read the markDelete position next position 
message, it will consume performance when the message is not in the cache, so 
add this flag
+* `SubscriptionStatsImpl` add `backlogDuration` variable
+* `AggregatedSubscriptionStats` add `backlogDuration` variable
+* add metric iterm named `pulsar_subscription_back_log_duration`
+
+# Detailed Design
+
+
+## Design & Implementation Details
+* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 
1) entry convert to `MessageMetadata` `publish_time` to represent the 
`earliestUnAckMessagePublishTime`
+* use currentTime - `earliestUnAckMessagePublishTime` represent 
`backlogDuration`
+* if markDelete haven't changed, don't need to get the new 
`earliestUnAckMessagePublishTime`, use directly, to reduce the read entry op
+
+## Public-facing Changes
+
+
+### Configuration
+
+* add config `subscriptionBacklogDurationEnabled` in `broker.conf`

Review Comment:
   Before naming a configuration, you must thing of the user experience. 
   First search for metrics related metric names to try to be more consistent.
   If you will, you will see many are of the form "expose*InPrometheus"



##########
pip/pip-285.md:
##########
@@ -0,0 +1,73 @@
+# Background knowledge
+Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" 
and "pulsar_subscription_back_log_no_delayed," provide valuable insights into 
the quantity of backlogged messages. However, they lack a metric that directly 
measures the duration of message backlog. Monitoring the duration of backlog is 
vital as it allows us to understand the persistence of message accumulation 
within a subscription over time.
+
+# Motivation
+
+The motivation behind introducing the new monitoring item 
"pulsar_subscription_backlog_duration" is to effectively monitor the health of 
subscriptions within the Pulsar messaging system. This health metric represents 
whether there are messages that have not been successfully acknowledged (ACKed) 
and potential consumer-side issues. Additionally, this monitoring item allows 
us to configure alerting mechanisms, ensuring timely notifications to users, 
thereby facilitating proactive response to potential issues.

Review Comment:
   Here you expose the subscription backlog age.
   We have a similar concept around backlog size. A subscription has backlog 
size, yet we also introduce backlog size in topic level. Maybe we should have a 
metric showing max() of this for age? pulsar_topic_max_subscription_age? Just 
an idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to