onceMisery commented on code in PR #25299:
URL: https://github.com/apache/pulsar/pull/25299#discussion_r2905645414
##########
pip/pip-459.md:
##########
@@ -0,0 +1,329 @@
+# PIP-459: Batch Status Summary and Filtered Listing for Pulsar Functions
+
+# Background knowledge
+
+Pulsar Functions are managed by the Functions worker and exposed through the
Pulsar Admin API and `pulsar-admin` CLI. Today, listing functions in a
namespace returns only function names. To understand runtime health, operators
must fetch function status for each function separately. That creates an N+1
request pattern: one request to list function names and one additional request
per function to fetch status. In practice, this is slow, noisy, and hard to use
in scripts or daily operations.
+
+Function runtime status is already represented by `FunctionStatus`, which
contains aggregate fields such as the total configured instances and the number
of running instances. This proposal introduces a lightweight namespace-level
summary model built from those counts. The summary is intentionally smaller
than the full status payload: it is designed for listing and filtering, not for
per-instance inspection.
+
+The proposal also has to account for mixed-version deployments. A new client
may talk to an older worker that does not expose the new summary endpoint yet.
In that case, the client must degrade safely instead of failing outright. That
compatibility requirement affects both the public admin interface and the
client-side implementation strategy.
+
+# Motivation
+
+`pulsar-admin functions list` currently cannot answer a basic operational
question: which functions in this namespace are actually running. Operators
must either inspect each function one by one or write shell loops that issue
many admin calls. The problem becomes more visible in namespaces containing
dozens or hundreds of functions, where a simple health check becomes expensive
and slow.
+
+The initial feature request was to add state-aware listing to the CLI, but the
implementation uncovered three broader issues that should be addressed together:
+
+1. The current user experience requires N+1 admin calls for namespace-level
health inspection.
+2. A new summary endpoint must not break new clients when they talk to older
workers during rolling upgrades.
+3. Namespace-level summary generation can become slow if the worker builds
results strictly serially.
+
+This PIP proposes a batch status summary API for Pulsar Functions and
integrates it into the admin client, CLI, and worker implementation in a
backward-compatible way.
+
+# Goals
+
+## In Scope
+
+- Add a namespace-level batch status summary API for Pulsar Functions.
+- Add a lightweight public data model that returns function name, derived
state, instance counts, and failure classification.
+- Add admin client support for the new summary API, including fallback to
legacy workers that do not implement it.
+- Add CLI support for long-format listing and state-based filtering using the
batch summary API.
+- Add pagination support for namespace-level function summaries.
+- Improve worker-side summary generation latency by using controlled
parallelism.
+- Add a worker configuration knob to cap summary-query parallelism.
+- Add a worker metric to observe summary-query execution time.
+
+## Out of Scope
+
+- Changing the existing `functions list` endpoint that returns only function
names.
+- Returning full per-instance function status from the new namespace-level
endpoint.
+- Adding equivalent summary endpoints for sources or sinks in this PIP.
+- Adding server-side filtering by state in the REST endpoint.
+- Reworking the underlying function runtime status model or scheduler behavior.
+
+
+# High Level Design
+
+This proposal adds a new REST endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint returns a list of `FunctionStatusSummary` objects. Each object
contains:
+
+- `name`
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`, `UNKNOWN`
+- `numInstances`
+- `numRunning`
+- `error`
+- `errorType`
+
+The server remains a generic summary provider. It does not apply state
filtering. The CLI consumes the summary list and performs presentation concerns
locally, such as `--state` filtering and `--long` formatting. This separation
keeps the endpoint reusable for other clients and prevents coupling the REST
contract to one CLI presentation format.
+
+For compatibility, the admin client first tries the new summary endpoint. If
the server responds with `404 Not Found` or `405 Method Not Allowed`, the
client falls back to the legacy flow: fetch the function names, apply
name-based pagination, and then query each function status individually to
build summaries client-side. This allows a new client to work against older
workers during mixed-version upgrades.
+
+On the worker side, summary generation is executed only for the requested page
and uses a bounded thread pool. A new worker configuration,
`functionsStatusSummaryMaxParallelism`, limits how many function status lookups
may run concurrently for a single summary request.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+### Data model
+
+A new public model, `FunctionStatusSummary`, is added under the admin API data
package. It intentionally returns only aggregate listing information:
+
+```java
+public class FunctionStatusSummary {
+ public enum SummaryState {
+ RUNNING,
+ STOPPED,
+ PARTIAL,
+ UNKNOWN
+ }
+
+ public enum ErrorType {
+ AUTHENTICATION_FAILED,
+ FUNCTION_NOT_FOUND,
+ NETWORK_ERROR,
+ INTERNAL_ERROR
+ }
+
+ private String name;
+ private SummaryState state;
+ private int numInstances;
+ private int numRunning;
+ private String error;
+ private ErrorType errorType;
+}
+```
+
+`state` is derived from aggregate instance counts:
+
+- `RUNNING`: `numRunning == numInstances` and `numInstances > 0`
+- `STOPPED`: `numRunning == 0` and `numInstances > 0`
+- `PARTIAL`: `0 < numRunning < numInstances`
+- `UNKNOWN`: the status query failed or the instance counts are not meaningful
+
+### Admin API interface compatibility
+
+The public `Functions` admin interface gains namespace-level summary methods:
+
+- `getFunctionsWithStatus(String tenant, String namespace)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace)`
+- `getFunctionsWithStatus(String tenant, String namespace, Integer limit,
String continuationToken)`
+- `getFunctionsWithStatusAsync(String tenant, String namespace, Integer limit,
String continuationToken)`
+
+These methods are introduced as `default` methods. This is important because
`Functions` is a public interface and can be implemented outside the Pulsar
repository. Adding abstract methods would break source or binary compatibility
for custom implementations. Using `default` methods preserves compatibility
while still exposing the new capability.
+
+The default implementation also provides a compatibility fallback path by
using the legacy list-plus-status flow if a server-side implementation is
unavailable.
+
+### REST endpoint
+
+The worker exposes a new endpoint:
+
+`GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+
+The endpoint accepts two optional query parameters:
+
+- `limit`: maximum number of functions to return; must be greater than `0`
when present
+- `continuationToken`: exclusive cursor based on function name in
lexicographical order
+
+The server obtains the namespace function list, sorts names lexicographically,
applies pagination, and then builds one `FunctionStatusSummary` per function in
the requested page.
+
+Per-function status retrieval failures are isolated. The endpoint should still
return `200 OK` with a summary entry whose `state` is `UNKNOWN` and whose
`error` and `errorType` explain the failure. This prevents one unhealthy
function from failing the whole namespace-level listing request.
+
+### Worker implementation
+
+The worker implementation builds summaries using controlled parallelism:
+
+1. Validate worker availability and request parameters.
+2. Reuse namespace authorization and list validation logic.
+3. Fetch and sort function names.
+4. Apply `continuationToken` and `limit`.
+5. Create a bounded executor sized by `min(pageSize,
functionsStatusSummaryMaxParallelism)`.
+6. Query status for functions in the requested page concurrently.
+7. Convert each result into `FunctionStatusSummary`.
+8. Record request latency in worker metrics.
+
+The worker uses two status retrieval paths:
+
+- Fast path: query local worker status logic directly.
+- Fallback path: if the local path fails with a recoverable server-side error,
use the worker's internal admin client.
+
+This fallback is useful for redirect and topology edge cases and keeps worker
behavior more robust.
+
+### Client fallback to legacy workers
+
+New clients may talk to workers that do not implement `/status/summary`. When
the admin client receives `404` or `405` from the new endpoint, it falls back
to:
+
+1. `getFunctions(tenant, namespace)`
+2. lexicographical sorting
+3. pagination using `limit` and `continuationToken`
+4. `getFunctionStatus(...)` for each function in the selected page
+
+This preserves functionality during rolling upgrades and mixed-version
environments. The fallback returns the same `FunctionStatusSummary` data model,
so callers do not need special handling.
+
+### CLI behavior
+
+The proposal extends `pulsar-admin functions list` with:
+
+- `--state`
+- `-l, --long`
+- `--limit`
+- `--continuation-token`
+
+Behavior:
+
+- With no new flags, the command keeps the existing behavior and prints only
function names.
+- With `--long`, the command prints summary rows that include state and
running/instance counts.
+- With `--state`, the CLI filters the summary list locally.
+- With `--limit` and `--continuation-token`, the CLI paginates through the
summary endpoint.
+
+The CLI rejects combining `--state` with `--limit` or `--continuation-token`.
+
+That restriction is intentional. Pagination is applied on the server to the
full ordered function list, while state filtering happens in the CLI after the
page is returned. Allowing both together would make users think they are
getting "a page of RUNNING functions" when they are actually getting "a page of
all functions, then a local state filter." Rejecting the combination keeps
command semantics honest.
+
+## Public-facing Changes
+
+### Public API
+
+New admin REST endpoint:
+
+- Path: `GET /admin/v3/functions/{tenant}/{namespace}/status/summary`
+- Query parameters:
+ - `limit` optional integer, must be greater than `0`
+ - `continuationToken` optional string, exclusive function-name cursor
+- Response body: JSON array of `FunctionStatusSummary`
+
+Response body fields for each element:
+
+- `name`: function name
+- `state`: `RUNNING`, `STOPPED`, `PARTIAL`, or `UNKNOWN`
+- `numInstances`: configured instance count
+- `numRunning`: currently running instance count
+- `error`: human-readable error message when the summary could not retrieve
function status
+- `errorType`: categorized failure reason when `state` is `UNKNOWN`
+
+Response codes:
+
+- `200 OK`: request succeeded; individual function failures are represented in
response elements rather than failing the whole request
+- `400 Bad Request`: invalid tenant, namespace, or `limit`
+- `403 Forbidden`: requester lacks permission to list functions in the
namespace
+- `503 Service Unavailable`: Functions worker service is unavailable
+
+Public admin client additions:
+
+- namespace-level `getFunctionsWithStatus(...)` methods, including
asynchronous and paginated variants
+- these methods are additive and implemented as `default` methods to preserve
compatibility for external implementations
+
+### Binary protocol
+
+No Pulsar binary protocol changes are introduced by this proposal.
+
+### Configuration
+
+New worker configuration:
+
+```yaml
+# Maximum parallelism used by function status-summary batch queries.
+functionsStatusSummaryMaxParallelism: 4
+```
+
+This configuration limits the number of concurrent status lookups used to
build a summary page. A lower value reduces pressure on worker and admin
dependencies. A higher value can reduce response latency for larger pages.
+
+### CLI
+
+New CLI options for `pulsar-admin functions list`:
+
+- `--state`
+- `-l`, `--long`
+- `--limit`
+- `--continuation-token`
+
+Example:
+
+```bash
+pulsar-admin functions list \
+ --tenant public \
+ --namespace default \
+ --state RUNNING \
+ --long
+```
+
+Example paginated call:
+
+```bash
+pulsar-admin functions list \
+ --tenant public \
+ --namespace default \
+ --limit 20 \
+ --continuation-token my-function-020
+```
+
+### Metrics
+
+- Full name: `pulsar_function_worker_functions_status_summary_query_time_ms`
+- Description: execution time of a namespace-level functions status-summary
batch query
+- Attributes: the standard worker metric labels configured by
`WorkerStatsManager`
+- Unit: milliseconds
+
+
+# Monitoring
+
+Operators should monitor the new summary query latency metric to detect
namespaces or environments where namespace-level function inspection is
becoming expensive. A practical first alert is sustained high p95 or max
latency for `pulsar_function_worker_functions_status_summary_query_time_ms`,
especially after enabling large page sizes or running many functions per
namespace.
+
+Operators should also treat growth in `UNKNOWN` summaries as a signal that
status retrieval is degraded. Since per-function failures are isolated into the
response instead of failing the whole request, dashboards or scripts that
consume the summary endpoint should count `state = UNKNOWN` and inspect
`errorType` to distinguish authentication problems, missing functions, network
failures, and internal errors.
+
+# Security Considerations
+
+The new endpoint uses the same namespace-level authorization model as existing
function listing operations. A caller must have permission to list functions in
the target namespace. The endpoint does not expose function data across tenant
or namespace boundaries and continues to rely on existing request validation
and authorization checks.
+
+Per-function failures are summarized as `UNKNOWN` entries with an error
classification instead of returning stack traces. This keeps the API useful
operationally while avoiding unnecessary disclosure of internal implementation
details. As with other admin APIs, operators should review whether error
messages returned to unprivileged callers are appropriate for their
deployment's security posture.
+
+Because the endpoint supports pagination through a continuation token derived
from function names, the implementation must continue to validate tenant and
namespace inputs before performing any listing or status lookup.
+
+# Backward & Forward Compatibility
+
+## Upgrade
+
+This proposal is additive. Existing calls to `getFunctions(...)` and
`pulsar-admin functions list` without new flags continue to behave as before.
+
+During rolling upgrades:
+
+- new clients can talk to old workers because the admin client falls back when
`/status/summary` returns `404` or `405`
+- new workers can serve both old and new clients
+- external `Functions` interface implementations remain compatible because the
new methods are `default` methods
+
+No special upgrade sequencing is required beyond standard rolling upgrade
practices.
+
+## Downgrade / Rollback
+
+If the cluster is downgraded to a version without `/status/summary`, new
clients will continue to function through the fallback path, although summary
requests will be slower because they revert to list-plus-status behavior.
+
+If the CLI or client is also downgraded, the new flags and endpoint are simply
unavailable and users return to the legacy per-function workflow.
+
+The worker configuration `functionsStatusSummaryMaxParallelism` should be
removed or ignored when running an older worker version that does not recognize
it.
+
+## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
+
+This proposal affects the Functions admin plane only. It does not change topic
replication protocols, metadata formats for geo-replication, or data-plane
message compatibility. No geo-replication-specific upgrade or rollback steps
are required.
+
+# Alternatives
+
+Client-side only aggregation was considered. It would require changing only
the CLI, but it preserves the N+1 request pattern and provides no reusable API
for other clients.
+
+Server-side filtering by state was also considered. It was rejected because
state filtering is a presentation concern and would make the endpoint less
reusable. Returning the raw summary set keeps the API generic and allows
clients to evolve their own filters without server-side API growth.
+
+Allowing `--state` together with pagination was considered, but rejected
because filtering after pagination produces misleading semantics. The current
proposal explicitly blocks that combination.
+
+Another alternative was overloading the existing list endpoint to optionally
return enriched status objects. That was rejected because one endpoint
returning different response types based on query flags complicates client
typing, documentation, and compatibility.
+
+# General Notes
+
+This PIP intentionally solves the namespace-level operator workflow first. The
same pattern may later be extended to sources and sinks, but that work is kept
separate to avoid broadening the review scope.
+
+The worker-side implementation still computes summaries by querying
per-function status and then aggregating results. Controlled parallelism
improves latency, but it does not fundamentally change the cost model. Future
work may explore more direct or cached aggregation paths if very large
namespaces make the endpoint hot.
+
+# Links
+
Review Comment:
Thanks for raising this concern!
You're absolutely right that the Admin API is stateless and requests can
hit
different workers. My design addresses this through deterministic,
name-based
cursor pagination:
**How it works:**
- continuationToken is simply the last function name from the previous page
- Each worker independently sorts all function names lexicographically
- The token acts as an exclusive lower bound to find the next page
- No server-side state is required; the token is self-contained
This approach works consistently across workers because the sorting is
deterministic and metadata is replicated.
**Trade-off:**
If functions are created/deleted during pagination, clients may see
duplicates
or miss entries. This is a common limitation of stateless cursor-based
pagination
in distributed systems (similar to Kubernetes list pagination or DynamoDB
queries).
**One improvement I should make:**
I also agree the API should return nextContinuationToken explicitly rather
than only a bare list, so I can update the response shape to make the paging
contract clear.
```json
{
"summaries": [...],
"nextContinuationToken": "func-xyz" // or null
}
```
Would this design work for you? I can update the PR accordingly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]