Hi team,

I have drafted a proposal to enhance observability of disaster recovery
scenarios
by adding broker-level metrics for non-recoverable data skipping.

Currently, when Pulsar's autoSkipNonRecoverableData feature skips
corrupted data to maintain topic availability, there is no visibility into
when and
how frequently this occurs. This creates operational blind spots where
administrators
cannot be alerted when data loss happens, have no audit trail for
compliance requirements,
and cannot distinguish between healthy systems and those silently losing
data.

Without these metrics, operators cannot determine whether issues are
systematic (entire ledgers lost) or localized (partial corruption
scenarios).

Proposed Solution: This PIP proposes adding two new broker-level metrics to
the BrokerOperabilityMetrics class:

  1. pulsar_broker_non_recoverable_ledgers_skipped_total:
      A counter incremented in ManagedLedgerImpl.skipNonRecoverableLedger()
      each time an entire ledger is skipped due to complete
unrecoverability.
  2. pulsar_broker_non_recoverable_entries_skipped_total:
      A counter incremented in
ManagedCursorImpl.skipNonRecoverableEntries()
      by the number of entries skipped when only partial ledger corruption
occurs.

The broker-level approach avoids adding a high-cardinality burden to the
metrics
system that would occur with topic-level metrics in large clusters.
Operators can
use these broker-level metrics for alerting and monitoring trends, then
leverage
existing broker logs for detailed forensic analysis of specific affected
topics.

The full proposal is available for review here:
https://github.com/apache/pulsar/pull/24716

I welcome any feedback, questions, or suggestions you may have.

Thanks,
Penghui

Reply via email to