jineshparakh opened a new pull request, #17741:
URL: https://github.com/apache/pinot/pull/17741
## Background
Helix task queues in Pinot can grow unboundedly over time. While Helix
provides time-based expiry for **COMPLETED** jobs (default 24h), it has no
built-in expiry for **FAILED/TIMED_OUT** jobs (`terminalStateExpiry` defaults
to `-1`, meaning disabled), and **ABORTED** jobs are never cleaned up at all.
In long-running clusters with high task throughput, this causes thousands of
stale terminal jobs to accumulate in ZooKeeper under
`/{cluster}/CONFIGS/RESOURCE`.
This ZK bloat degrades Helix controller performance,`WorkflowConfig` and
`WorkflowContext` reads become increasingly expensive.
## Changes
This PR introduces a multi-layered defense against task queue bloat:
### 1. Count-based queue trimming *(new)*
Configurable size-based cleanup that runs during `prepTaskQueue`, before new
tasks are generated. When the queue exceeds a configurable limit
(`controller.task.queue.maxSize`, default `-1`/disabled), the oldest terminal
jobs (`COMPLETED`, `FAILED`, `TIMED_OUT`, `ABORTED`) are force-deleted up to a
configurable cap per cycle (`controller.task.queue.maxDeletesPerCycle`, default
`100`, hard cap `500`).
The implementation uses a bounded max-heap ordered by Helix-recorded job
start times (`WorkflowContext.getJobStartTime`) to efficiently select the N
oldest terminal jobs without sorting the entire queue. Jobs with unknown start
times (`-1`/`NOT_STARTED`) are treated as newest to prevent accidental deletion.
**Design note:** The deletion loop runs inside the existing `synchronized`
block. Since this is a proactive fix to prevent gradual ZK/Helix performance
degradation, a brief hold time (bounded by `maxDeletesPerCycle`, default `100`)
during a background scheduling cycle is an acceptable trade-off over the
alternative of unbounded queue growth causing systemic slowdowns.
### 2. Terminal state expiry for FAILED/TIMED_OUT jobs *(new)*
Configures Helix's `JobConfig.terminalStateExpiry` so that `FAILED` and
`TIMED_OUT` jobs are eventually purged by Helix's own cleanup mechanism.
Default: 72 hours (`controller.task.terminalStateExpireTimeMs`).
### 3. Configurable task expiry *(enhancement)*
The existing `COMPLETED` job expiry (`controller.task.expire.time.ms`,
default `24h`) was already configurable at controller startup but required a
restart to change. This PR makes it hot-reloadable via cluster config changes.
### 4. Helix queue capacity *(new)*
Sets `WorkflowConfig.capacity` on newly created task queues as a hard
ceiling (`controller.task.queue.capacity`, default `-1`/unlimited). When the
queue reaches capacity during enqueue, Helix purges expired jobs first; if
still full, enqueue fails with `HelixException`. This acts as a safety net
complementing the soft count-based trimming.
### 5. Queue size warning threshold *(new)*
Logs a warning when queue size exceeds a configurable threshold
(`controller.task.queue.warningThreshold`, default `5000`). This warning fires
independently of whether count-based trimming is enabled, giving operators
early visibility into queue growth.
### 6. Task queue size metric *(new)*
Emits a `TASK_QUEUE_SIZE` gauge per task type for monitoring and alerting.
The gauge resets to `0` if a task type disappears, preventing stale metric
values.
### 7. Dynamic configuration
All new and modified settings are hot-reloadable via cluster config changes
(`PinotClusterConfigChangeListener` / `onChange`) without requiring a
controller restart. Input validation is applied:
- `taskExpireTimeMs` and `terminalStateExpireTimeMs` reject non-positive
values
- `maxDeletesPerCycle` rejects non-positive, clamps values above `500`
- `capacity` accepts only `-1` (unlimited) or positive integers
*NOTE:* All the newly introduced changes are disabled by default
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]