amaechler opened a new pull request, #18715:
URL: https://github.com/apache/druid/pull/18715

   ### Description
   
   Fixes a bug where the SeekableStream supervisor autoscaler creates duplicate 
history entries every `minTriggerScaleActionFrequencyMillis` (default 10min) 
during scale-down operations, causing database pollution and preventing 
scale-down from completing.
   
   _Lots of help from Claude._
   
   #### Problem
   
   When the autoscaler scales down tasks, `clearAllocationInfo()` prematurely 
clears `pendingCompletionTaskGroups`, causing the supervisor to "forget" about 
tasks transitioning from READING to PUBLISHING state. On the next supervisor 
cycle, these tasks are rediscovered and re-added to 
`activelyReadingTaskGroups`, triggering another scale-down attempt and creating 
a duplicate history entry. This repeats every 
`minTriggerScaleActionFrequencyMillis` (default: 10 minutes). I saw hundreds of 
duplicate history entries, with entries created at exact 10-minute intervals.
   
   The root cause is that the autoscaler has a built-in safeguard (line 
480-496) to skip scale actions when `pendingCompletionTaskGroups` is non-empty, 
but this check is ineffective because `clearAllocationInfo()` clears the map 
immediately after tasks were moved there.
   
   #### Solution
   
   Preserve `pendingCompletionTaskGroups` in `clearAllocationInfo()`. This 
allows the autoscaler's existing skip logic to function correctly, preventing 
duplicate scale attempts until tasks naturally complete (removed by 
`checkPendingCompletionTasks()` every supervisor cycle).
   
   #### Release note
   
   Fixed a bug in the SeekableStream supervisor autoscaler where scale-down 
operations would create duplicate supervisor history entries. The autoscaler 
now correctly waits for tasks to complete before attempting subsequent scale 
operations.
   
   <hr>
   
   ##### Key changed/added classes in this PR
   
   * `SeekableStreamSupervisor` - Modified `clearAllocationInfo()` to preserve 
`pendingCompletionTaskGroups`
   
   <hr>
   
   This PR has:
   
   * [x] been self-reviewed.
   * [ ] added documentation for new or modified features or behaviors.
   * [x] a release note entry in the PR description.
   * [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   * [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   * [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   * [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   * [ ] added integration tests.
   * [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to