mattisonchao commented on code in PR #25513:
URL: https://github.com/apache/pulsar/pull/25513#discussion_r3218280882


##########
pip/pip-471.md:
##########
@@ -0,0 +1,302 @@
+# PIP-471: Authorization Operation Metrics
+
+# Background knowledge
+
+Pulsar brokers perform authorization checks before allowing clients, proxies, 
and administrative callers to access
+topics, namespaces, tenants, brokers, clusters, and policy operations. These 
checks are handled through the broker-side
+`AuthorizationService`, which delegates decisions to the configured 
`AuthorizationProvider`.
+
+Pulsar already exposes security-related metrics, especially around 
authentication. These metrics help operators detect
+login failures, unhealthy clients, and changes in access patterns. However, 
Pulsar does not expose a generic broker-level
+metric stream for authorization outcomes. Authorization denials are mostly 
visible through request failures and logs,
+which makes them harder to alert on and harder to compare with successful 
authorization traffic.
+
+Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry 
metrics. New broker observability features
+should keep those pipelines aligned when possible, so operators can consume 
equivalent signals regardless of their
+metrics backend.
+
+# Motivation
+
+Operators need a low-cardinality, broker-native signal that shows whether 
authorization checks are succeeding or failing.
+This is useful for security alerting, baseline monitoring, and 
compliance-oriented reporting.
+
+Without a dedicated authorization metric, operators have to infer 
authorization denials from logs, HTTP status codes, or
+client-side errors. That is brittle and does not support standard monitoring 
patterns such as:
+
+- Alerting on spikes in authorization failures.
+- Comparing authorization failures against successful authorizations.
+- Distinguishing authentication failures from authorization failures.
+- Building dashboards by authorization resource category.
+- Exporting equivalent authorization signals through both Prometheus and 
OpenTelemetry.
+
+A failure-only metric is also not sufficient. Operators often need success, 
failure, and error counts together to
+understand whether a denial spike reflects an attack, a rollout issue, a 
policy mistake, an authorization provider
+problem, or a normal traffic shift.
+
+# Goals
+
+## In Scope
+
+- Add a low-cardinality broker authorization metric for operation outcomes.
+- Record successful, failed, and errored authorization operations.
+- Expose the metric through the Prometheus-compatible broker metrics endpoint.
+- Expose the same metric through OpenTelemetry.
+- Centralize instrumentation in `AuthorizationService` so broker authorization 
paths share the same metric model.
+- Avoid identity-bearing or high-cardinality metric dimensions.
+
+## Out of Scope
+
+- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels.
+- Audit-log payloads or structured security event streams.
+- New authorization APIs or binary protocol changes.
+- Alert rule definitions for downstream monitoring stacks.
+- Configuration to enable or disable this specific metric.
+
+# High Level Design
+
+Introduce a generic authorization operation counter that is incremented when 
the broker finishes an authorization
+decision or rejects an authorization request before invoking the configured 
provider.
+
+The metric is recorded centrally in `AuthorizationService`, which is the 
broker-side entry point for authorization checks
+across topic, namespace, tenant, broker, cluster, and policy operations. Each 
completed provider decision or direct
+authorization rejection emits one result with a small, fixed dimension set:
+
+- the resource type that was checked
+- the operation that was requested
+- whether the result was a success, failure, or error
+
+This metric is exported in two equivalent forms by the same helper class:
+
+- a Prometheus counter for the existing broker metrics endpoint
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+Invalid original-principal combinations in proxied authorization flows are 
counted as authorization failures because the
+broker rejects the request during authorization handling. For valid proxied 
authorization flows, the broker evaluates
+both the proxy role and the original principal, and each completed 
authorization decision is recorded.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+This proposal introduces 
`org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker 
authorization
+metrics helper that owns:
+
+- a Prometheus `Counter` for broker metrics scraping
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+The helper uses the following constants for metric names, instrumentation 
scope, label values, and OpenTelemetry
+attribute keys:
+
+| Constant | Value |
+|---|---|
+| `AUTHORIZATION_OPERATIONS_METRIC_NAME` | 
`pulsar_authorization_operations_total` |
+| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` 
|
+| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
+| `RESULT_SUCCESS` | `success` |
+| `RESULT_FAILURE` | `failure` |
+| `RESULT_ERROR` | `error` |
+| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
+| `OPERATION_KEY` | `pulsar.authorization.operation` |
+| `RESULT_KEY` | `pulsar.authorization.result` |
+
+`AuthorizationMetrics` registers a static Prometheus counter with labels 
`resource_type`, `operation`, and `result`.
+It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` 
instance passed to the constructor.
+
+The helper exposes three recording methods:
+
+| Method | Behavior |
+|---|---|
+| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter 
with `result="success"` and adds `1` to the OpenTelemetry counter with 
`pulsar.authorization.result="success"`. |
+| `recordFailure(resourceType, operation)` | Increments the Prometheus counter 
with `result="failure"` and adds `1` to the OpenTelemetry counter with 
`pulsar.authorization.result="failure"`. |
+| `recordError(resourceType, operation)` | Increments the Prometheus counter 
with `result="error"` and adds `1` to the OpenTelemetry counter with 
`pulsar.authorization.result="error"`. |
+
+`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing 
`AuthorizationService` constructor remains
+available and delegates to a new constructor with `OpenTelemetry.noop()`. 
`BrokerService` constructs
+`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so 
the OpenTelemetry counter is exported by
+the broker's OpenTelemetry pipeline.
+
+`AuthorizationService` records a result after each completed authorization 
operation. If the provider returns `true`, the
+helper records a success. If the provider returns `false`, the helper records 
a failure. If the provider future completes
+exceptionally, the helper records an error because authorization evaluation 
failed before a boolean decision was returned.
+
+If `AuthorizationService` rejects a request before provider evaluation, such 
as an invalid original-principal combination
+for proxied requests, it records a failure directly and returns a completed 
`false` future. Existing
+authorization-disabled short-circuit behavior is preserved; operation methods 
that already return early when
+authorization is disabled do not emit this metric on that path.
+
+The instrumentation applies to the following authorization flows:
+
+- superuser checks
+- tenant-admin checks
+- tenant operations
+- broker operations
+- cluster operations
+- cluster policy operations
+- namespace operations
+- namespace policy operations
+- topic operations
+- topic policy operations
+
+The metric dimensions are intentionally bounded. The resource type is selected 
from a fixed set of constants in
+`AuthorizationMetrics`. The operation is `check` for superuser and 
tenant-admin checks. For enum-backed operations, the
+operation is the lower-case enum name. If an existing authorization path does 
not provide an operation value, the metric
+uses a fixed `unknown` operation value rather than failing the request path or 
introducing dynamic labels.
+
+The metric does not include role names, topic names, tenant names, namespace 
names, client addresses, provider names,
+exception classes, or error messages.
+
+## Public-facing Changes
+
+### Public API
+
+No public client, admin, REST, or `AuthorizationProvider` API changes.
+
+### Binary protocol
+
+No binary protocol changes.
+
+### Configuration
+
+No new configuration is required.
+
+### CLI
+
+No CLI changes.
+
+### Metrics
+
+Prometheus metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar_authorization_operations_total` |
+| Description | Pulsar authorization operations |
+| Type | Counter |
+| Labels | `resource_type`, `operation`, `result` |
+| Unit | operations |
+
+OpenTelemetry metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar.authorization.operation.count` |
+| Description | The number of authorization operations |
+| Type | `LongCounter` |
+| Attributes | `pulsar.authorization.resource.type`, 
`pulsar.authorization.operation`, `pulsar.authorization.result` |
+| Unit | `{operation}` |
+
+Result values:
+
+| Value | Meaning |
+|---|---|
+| `success` | The authorization request was allowed. |
+| `failure` | The authorization request was denied or rejected by 
authorization handling. |

Review Comment:
   sure. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to