congbobo184 commented on code in PR #20859: URL: https://github.com/apache/pulsar/pull/20859#discussion_r1289518559
########## pip/pip-285.md: ########## @@ -0,0 +1,73 @@ +# Background knowledge +Existing monitoring items in Pulsar, such as "pulsar_subscription_back_log" and "pulsar_subscription_back_log_no_delayed," provide valuable insights into the quantity of backlogged messages. However, they lack a metric that directly measures the duration of message backlog. Monitoring the duration of backlog is vital as it allows us to understand the persistence of message accumulation within a subscription over time. + +# Motivation + +The motivation behind introducing the new monitoring item "pulsar_subscription_backlog_duration" is to effectively monitor the health of subscriptions within the Pulsar messaging system. This health metric represents whether there are messages that have not been successfully acknowledged (ACKed) and potential consumer-side issues. Additionally, this monitoring item allows us to configure alerting mechanisms, ensuring timely notifications to users, thereby facilitating proactive response to potential issues. + +Maintaining the health of subscriptions is of paramount importance for the smooth operation of a messaging system. As message delivery involves interaction between producers and consumers, backlogs or unacknowledged messages can lead to data delays or losses. By monitoring "pulsar_subscription_backlog_duration," we can gain real-time insights into the duration of message backlogs and promptly detect any potential processing issues within subscriptions. + +The configuration and alerting settings for this monitoring item play a crucial role in responding swiftly to issues. When "pulsar_subscription_backlog_duration" indicates an abnormal increase in duration or unusual message backlogs, system administrators receive immediate alert notifications. These alerts enable administrators to quickly identify problems and take necessary measures to prevent message losses or further delays. + +In conclusion, the introduction of the "pulsar_subscription_backlog_duration" monitoring item enables effective monitoring of subscription health, real-time issue detection, and prevention of message delivery delays and losses. Additionally, timely alerting mechanisms empower proactive responses, ensuring the reliability and efficiency of the messaging system. This is essential for providing high-quality message delivery services, ensuring user experiences, and maintaining data integrity. + +# Goals + +## In Scope + +* SubscriptionStatsImpl add this stat +* Metrics + + +## Out of Scope + +* Implementing changes to the core functionality of the Pulsar messaging system itself. +* Not include `NonPersistentTopic`. +* Not include `DelayMessage` + +# High Level Design + +* add config `subscriptionBacklogDurationEnabled` in `broker.conf` + * note: because we need to read the markDelete position next position message, it will consume performance when the message is not in the cache, so add this flag +* `SubscriptionStatsImpl` add `backlogDuration` variable +* `AggregatedSubscriptionStats` add `backlogDuration` variable +* add metric iterm named `pulsar_subscription_back_log_duration` + +# Detailed Design + + +## Design & Implementation Details +* when `PersistentSubscription` invoke getStats then reade the (`markDelete` + 1) entry convert to `MessageMetadata` `publish_time` to represent the `earliestUnAckMessagePublishTime` Review Comment: sorry for late response. in pulsar have 4 configurations for roll over ledger: ``` managedLedgerMaxEntriesPerLedger=50000 # Minimum time between ledger rollover for a topic managedLedgerMinLedgerRolloverTimeMinutes=10 # Maximum time before forcing a ledger rollover for a topic managedLedgerMaxLedgerRolloverTimeMinutes=240 # Maximum ledger size before triggering a rollover for a topic (MB) managedLedgerMaxSizePerLedgerMbytes=2048 ``` If using the default settings, it can be anticipated that the maximum allowable gap in the ledger is 240 minutes. How can we accurately calculate the latency time? I don't think reducing this value to decrease the error is a feasible solution, as it would impose significant stress on ZooKeeper. When the sub is in a "catch up read" state, the markdelete messages will be present in the cache. This will not result in any I/O overhead. the I/O overhead only will occurs in a small number of subscriptions, the current implementation does not result in redundant I/O operations when there is no change in the "mark delete" status. It's important to note that the sending rate of each topic isn't uniform and can vary with the occurrence of many events. This can lead to inaccuracies in estimating the timing. For certain businesses, the acceptable accumulation time for latency could be on the order of minutes. When the estimation is inaccurate, it can result in a significant number of false alarms. When frequent false alarms occur, it significantly impacts the user experience of Pulsar. Maybe we can have two sets of implementations to support different scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
