gianm commented on PR #19091: URL: https://github.com/apache/druid/pull/19091#issuecomment-4020047307
> * `CostBasedAutoScaler` scales down tasks if processing rate is zero, even if lag is high. (I guess this shouldn't happen in practice since if lag is high, tasks must be busy with some processing. Except when there is a bug and task threads are stuck somewhere.) What'd you have it do in this situation? I suppose if processing rate is zero but lag is high, we shouldn't scale at all (either up or down). The combination of those two metrics suggests something is broken that scaling isn't going to fix. > * `SeekableStreamSupervisorIOConfig` did not honor the correct priority order of `taskCountStart` > `taskCount` > `taskCountMin` Huh, I thought we already changed that in #18745. Could you please also update the comment above the code block that currently says: `// if autoscaler is enabled, then taskCount will be ignored`. Please also update `docs/ingestion/supervisor.md`. It says some things that are no longer accurate, such as saying in the docs for `taskCountMin` and `taskCountStart` that "When you enable the autoscaler, Druid ignores the value of taskCount". > I want to merge the current patch and verify if the `isAnotherTaskGroupPublishing()` method is effective in identifying if there are prior pending publish actions. If it works as expected, I think we can build upon it further to get the queue semantics on the supervisor. > > @gianm , @jtuglu1 , what are your thoughts? I'm ok with going ahead with this patch given that more work is coming in this area, and that hopefully the `retryable` flag is going to prevent us from retrying things that will never succeed. I also believe that it would be better to move away from retries completely. The supervisor should be able to do more to help tasks sequence their publishes. That would also fix the long-standing issue that can happen with short-lived tasks: if task A is publishing, and its replacement task B runs for a short period of time and then starts publishing, it can finish before A and fail with a mismatched offsets error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
