Denovo1998 commented on code in PR #24716:
URL: https://github.com/apache/pulsar/pull/24716#discussion_r2352615620


##########
pip/pip-441.md:
##########
@@ -0,0 +1,115 @@
+# PIP-441: Add Broker-Level Metrics for Skipped Non-Recoverable Data
+
+# Background knowledge
+
+Pulsar's `autoSkipNonRecoverableData` feature allows brokers to skip corrupted 
data during disaster recovery to maintain topic availability. The system uses 
two skip strategies:
+
+1. **Ledger-level skipping**: Skips entire ledgers when completely 
unrecoverable
+2. **Entry-level skipping**: Skips specific entries within a ledger when only 
partially corrupted
+
+The entry level skipping was introduced in PIP-327 and refined in [PR 
#17753](https://github.com/apache/pulsar/pull/17753) to handle scenarios like 
ledger corruption, bookie failures, and partial data loss.
+
+# Motivation
+
+Currently, there is no visibility into when and how frequently non-recoverable 
data is being skipped, creating operational challenges:
+
+- **No alerting capability** when data loss occurs
+- **No audit trail** for compliance and data integrity requirements  
+- **Cannot distinguish** between healthy systems and those silently skipping 
data
+- **Cannot determine** if data loss is wholesale (ledgers) or partial (entries)
+- **Limited capacity planning** without understanding failure patterns
+
+# Goals
+
+## In Scope
+
+Add two broker-level metrics:
+- `pulsar_broker_non_recoverable_ledgers_skipped_total` - Count of ledgers 
skipped
+- `pulsar_broker_non_recoverable_entries_skipped_total` - Count of entries 
skipped
+
+## Out of Scope
+
+- Topic/subscription-level metrics (would burden metrics system with high 
cardinality)
+- Historical tracking of specific ledgers/entries skipped
+- Changes to existing `autoSkipNonRecoverableData` functionality
+
+# High Level Design
+
+Two broker-level counters will be added to `BrokerOperabilityMetrics`:
+
+- **Ledger counter**: Incremented in 
`ManagedLedgerImpl.skipNonRecoverableLedger()`
+- **Entry counter**: Incremented in 
`ManagedCursorImpl.skipNonRecoverableEntries()`
+
+Both metrics are exposed via the existing Prometheus `/metrics` endpoint.
+
+# Detailed Design
+
+## Implementation Details
+
+**BrokerOperabilityMetrics Changes:**
+```java
+private final LongAdder nonRecoverableLedgersSkippedCount;
+private final LongAdder nonRecoverableEntriesSkippedCount;
+
+public void recordNonRecoverableLedgerSkipped() {
+    this.nonRecoverableLedgersSkippedCount.increment();
+}
+
+public void recordNonRecoverableEntriesSkipped(long entriesCount) {
+    this.nonRecoverableEntriesSkippedCount.add(entriesCount);
+}
+```
+
+**Integration Points:**
+- `ManagedLedgerImpl.skipNonRecoverableLedger()` → calls 
`recordNonRecoverableLedgerSkipped()`  
+- `ManagedCursorImpl.skipNonRecoverableEntries()` → calls 
`recordNonRecoverableEntriesSkipped(count)`
+
+**OpenTelemetry Support:**
+- `pulsar.broker.non_recoverable_ledger.skip.count` 
+- `pulsar.broker.non_recoverable_entries.skip.count`
+
+## Public-facing Changes
+
+### Metrics
+
+| Metric Name | Description | Type |
+|-------------|-------------|------|
+| `pulsar_broker_non_recoverable_ledgers_skipped_total` | Count of ledgers 
skipped when `autoSkipNonRecoverableData` enabled | Counter |
+| `pulsar_broker_non_recoverable_entries_skipped_total` | Count of entries 
skipped when `autoSkipNonRecoverableData` enabled | Counter |
+
+**Labels:** `broker`, `cluster`
+
+# Monitoring
+
+**Use Cases:**
+- **Alerting**: Get notified when data loss occurs 
+- **SLA Monitoring**: Track data durability metrics
+- **Root Cause Analysis**: Compare metrics to understand if issues are 
systematic (ledger-level) or localized (entry-level)
+- **Investigation**: Use metrics for alerting, then check broker logs for 
specific topic details

Review Comment:
   Broker logs can pinpoint which specific topic is having an issue. 
   Is this log currently recorded in the code? If so, where is it located?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to