[PR] Bound task queue size to prevent ZK bloat [pinot]

via GitHub Sun, 22 Feb 2026 22:26:26 -0800


jineshparakh opened a new pull request, #17741:
URL: https://github.com/apache/pinot/pull/17741


   ## Background
   
   Helix task queues in Pinot can grow unboundedly over time. While Helix 
provides time-based expiry for **COMPLETED** jobs (default 24h), it has no 
built-in expiry for **FAILED/TIMED_OUT** jobs (`terminalStateExpiry` defaults 
to `-1`, meaning disabled), and **ABORTED** jobs are never cleaned up at all.
   
   In long-running clusters with high task throughput, this causes thousands of 
stale terminal jobs to accumulate in ZooKeeper under 
`/{cluster}/CONFIGS/RESOURCE`.
   
   This ZK bloat degrades Helix controller performance,`WorkflowConfig` and 
`WorkflowContext` reads become increasingly expensive.
   
   ## Changes
   
   This PR introduces a multi-layered defense against task queue bloat:
   
   ### 1. Count-based queue trimming *(new)*
   
   Configurable size-based cleanup that runs during `prepTaskQueue`, before new 
tasks are generated. When the queue exceeds a configurable limit 
(`controller.task.queue.maxSize`, default `-1`/disabled), the oldest terminal 
jobs (`COMPLETED`, `FAILED`, `TIMED_OUT`, `ABORTED`) are force-deleted up to a 
configurable cap per cycle (`controller.task.queue.maxDeletesPerCycle`, default 
`100`, hard cap `500`).
   
   The implementation uses a bounded max-heap ordered by Helix-recorded job 
start times (`WorkflowContext.getJobStartTime`) to efficiently select the N 
oldest terminal jobs without sorting the entire queue. Jobs with unknown start 
times (`-1`/`NOT_STARTED`) are treated as newest to prevent accidental deletion.
   
   **Design note:** The deletion loop runs inside the existing `synchronized` 
block. Since this is a proactive fix to prevent gradual ZK/Helix performance 
degradation, a brief hold time (bounded by `maxDeletesPerCycle`, default `100`) 
during a background scheduling cycle is an acceptable trade-off over the 
alternative of unbounded queue growth causing systemic slowdowns.
   
   ### 2. Terminal state expiry for FAILED/TIMED_OUT jobs *(new)*
   
   Configures Helix's `JobConfig.terminalStateExpiry` so that `FAILED` and 
`TIMED_OUT` jobs are eventually purged by Helix's own cleanup mechanism. 
Default: 72 hours (`controller.task.terminalStateExpireTimeMs`).
   
   ### 3. Configurable task expiry *(enhancement)*
   
   The existing `COMPLETED` job expiry (`controller.task.expire.time.ms`, 
default `24h`) was already configurable at controller startup but required a 
restart to change. This PR makes it hot-reloadable via cluster config changes.
   
   ### 4. Helix queue capacity *(new)*
   
   Sets `WorkflowConfig.capacity` on newly created task queues as a hard 
ceiling (`controller.task.queue.capacity`, default `-1`/unlimited). When the 
queue reaches capacity during enqueue, Helix purges expired jobs first; if 
still full, enqueue fails with `HelixException`. This acts as a safety net 
complementing the soft count-based trimming.
   
   ### 5. Queue size warning threshold *(new)*
   
   Logs a warning when queue size exceeds a configurable threshold 
(`controller.task.queue.warningThreshold`, default `5000`). This warning fires 
independently of whether count-based trimming is enabled, giving operators 
early visibility into queue growth.
   
   ### 6. Task queue size metric *(new)*
   
   Emits a `TASK_QUEUE_SIZE` gauge per task type for monitoring and alerting. 
The gauge resets to `0` if a task type disappears, preventing stale metric 
values.
   
   ### 7. Dynamic configuration
   
   All new and modified settings are hot-reloadable via cluster config changes 
(`PinotClusterConfigChangeListener` / `onChange`) without requiring a 
controller restart. Input validation is applied:
   - `taskExpireTimeMs` and `terminalStateExpireTimeMs` reject non-positive 
values
   - `maxDeletesPerCycle` rejects non-positive, clamps values above `500`
   - `capacity` accepts only `-1` (unlimited) or positive integers
   
   *NOTE:* All the newly introduced changes are disabled by default


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Bound task queue size to prevent ZK bloat [pinot]

Reply via email to