Hi team, I have drafted a proposal to enhance observability of disaster recovery scenarios by adding broker-level metrics for non-recoverable data skipping.
Currently, when Pulsar's autoSkipNonRecoverableData feature skips corrupted data to maintain topic availability, there is no visibility into when and how frequently this occurs. This creates operational blind spots where administrators cannot be alerted when data loss happens, have no audit trail for compliance requirements, and cannot distinguish between healthy systems and those silently losing data. Without these metrics, operators cannot determine whether issues are systematic (entire ledgers lost) or localized (partial corruption scenarios). Proposed Solution: This PIP proposes adding two new broker-level metrics to the BrokerOperabilityMetrics class: 1. pulsar_broker_non_recoverable_ledgers_skipped_total: A counter incremented in ManagedLedgerImpl.skipNonRecoverableLedger() each time an entire ledger is skipped due to complete unrecoverability. 2. pulsar_broker_non_recoverable_entries_skipped_total: A counter incremented in ManagedCursorImpl.skipNonRecoverableEntries() by the number of entries skipped when only partial ledger corruption occurs. The broker-level approach avoids adding a high-cardinality burden to the metrics system that would occur with topic-level metrics in large clusters. Operators can use these broker-level metrics for alerting and monitoring trends, then leverage existing broker logs for detailed forensic analysis of specific affected topics. The full proposal is available for review here: https://github.com/apache/pulsar/pull/24716 I welcome any feedback, questions, or suggestions you may have. Thanks, Penghui