amaechler opened a new pull request, #18715: URL: https://github.com/apache/druid/pull/18715
### Description Fixes a bug where the SeekableStream supervisor autoscaler creates duplicate history entries every `minTriggerScaleActionFrequencyMillis` (default 10min) during scale-down operations, causing database pollution and preventing scale-down from completing. _Lots of help from Claude._ #### Problem When the autoscaler scales down tasks, `clearAllocationInfo()` prematurely clears `pendingCompletionTaskGroups`, causing the supervisor to "forget" about tasks transitioning from READING to PUBLISHING state. On the next supervisor cycle, these tasks are rediscovered and re-added to `activelyReadingTaskGroups`, triggering another scale-down attempt and creating a duplicate history entry. This repeats every `minTriggerScaleActionFrequencyMillis` (default: 10 minutes). I saw hundreds of duplicate history entries, with entries created at exact 10-minute intervals. The root cause is that the autoscaler has a built-in safeguard (line 480-496) to skip scale actions when `pendingCompletionTaskGroups` is non-empty, but this check is ineffective because `clearAllocationInfo()` clears the map immediately after tasks were moved there. #### Solution Preserve `pendingCompletionTaskGroups` in `clearAllocationInfo()`. This allows the autoscaler's existing skip logic to function correctly, preventing duplicate scale attempts until tasks naturally complete (removed by `checkPendingCompletionTasks()` every supervisor cycle). #### Release note Fixed a bug in the SeekableStream supervisor autoscaler where scale-down operations would create duplicate supervisor history entries. The autoscaler now correctly waits for tasks to complete before attempting subsequent scale operations. <hr> ##### Key changed/added classes in this PR * `SeekableStreamSupervisor` - Modified `clearAllocationInfo()` to preserve `pendingCompletionTaskGroups` <hr> This PR has: * [x] been self-reviewed. * [ ] added documentation for new or modified features or behaviors. * [x] a release note entry in the PR description. * [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. * [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) * [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. * [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. * [ ] added integration tests. * [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
