Sounds very useful!

-Lari

On 2025/09/08 16:24:10 PengHui Li wrote:
> Hi team,
> 
> I have drafted a proposal to enhance observability of disaster recovery
> scenarios
> by adding broker-level metrics for non-recoverable data skipping.
> 
> Currently, when Pulsar's autoSkipNonRecoverableData feature skips
> corrupted data to maintain topic availability, there is no visibility into
> when and
> how frequently this occurs. This creates operational blind spots where
> administrators
> cannot be alerted when data loss happens, have no audit trail for
> compliance requirements,
> and cannot distinguish between healthy systems and those silently losing
> data.
> 
> Without these metrics, operators cannot determine whether issues are
> systematic (entire ledgers lost) or localized (partial corruption
> scenarios).
> 
> Proposed Solution: This PIP proposes adding two new broker-level metrics to
> the BrokerOperabilityMetrics class:
> 
>   1. pulsar_broker_non_recoverable_ledgers_skipped_total:
>       A counter incremented in ManagedLedgerImpl.skipNonRecoverableLedger()
>       each time an entire ledger is skipped due to complete
> unrecoverability.
>   2. pulsar_broker_non_recoverable_entries_skipped_total:
>       A counter incremented in
> ManagedCursorImpl.skipNonRecoverableEntries()
>       by the number of entries skipped when only partial ledger corruption
> occurs.
> 
> The broker-level approach avoids adding a high-cardinality burden to the
> metrics
> system that would occur with topic-level metrics in large clusters.
> Operators can
> use these broker-level metrics for alerting and monitoring trends, then
> leverage
> existing broker logs for detailed forensic analysis of specific affected
> topics.
> 
> The full proposal is available for review here:
> https://github.com/apache/pulsar/pull/24716
> 
> I welcome any feedback, questions, or suggestions you may have.
> 
> Thanks,
> Penghui
> 

Reply via email to