Sounds very useful! -Lari
On 2025/09/08 16:24:10 PengHui Li wrote: > Hi team, > > I have drafted a proposal to enhance observability of disaster recovery > scenarios > by adding broker-level metrics for non-recoverable data skipping. > > Currently, when Pulsar's autoSkipNonRecoverableData feature skips > corrupted data to maintain topic availability, there is no visibility into > when and > how frequently this occurs. This creates operational blind spots where > administrators > cannot be alerted when data loss happens, have no audit trail for > compliance requirements, > and cannot distinguish between healthy systems and those silently losing > data. > > Without these metrics, operators cannot determine whether issues are > systematic (entire ledgers lost) or localized (partial corruption > scenarios). > > Proposed Solution: This PIP proposes adding two new broker-level metrics to > the BrokerOperabilityMetrics class: > > 1. pulsar_broker_non_recoverable_ledgers_skipped_total: > A counter incremented in ManagedLedgerImpl.skipNonRecoverableLedger() > each time an entire ledger is skipped due to complete > unrecoverability. > 2. pulsar_broker_non_recoverable_entries_skipped_total: > A counter incremented in > ManagedCursorImpl.skipNonRecoverableEntries() > by the number of entries skipped when only partial ledger corruption > occurs. > > The broker-level approach avoids adding a high-cardinality burden to the > metrics > system that would occur with topic-level metrics in large clusters. > Operators can > use these broker-level metrics for alerting and monitoring trends, then > leverage > existing broker logs for detailed forensic analysis of specific affected > topics. > > The full proposal is available for review here: > https://github.com/apache/pulsar/pull/24716 > > I welcome any feedback, questions, or suggestions you may have. > > Thanks, > Penghui >
