gianm commented on PR #19091:
URL: https://github.com/apache/druid/pull/19091#issuecomment-4020047307

   > * `CostBasedAutoScaler` scales down tasks if processing rate is zero, even 
if lag is high. (I guess this shouldn't happen in practice since if lag is 
high, tasks must be busy with some processing. Except when there is a bug and 
task threads are stuck somewhere.)
   
   What'd you have it do in this situation? I suppose if processing rate is 
zero but lag is high, we shouldn't scale at all (either up or down). The 
combination of those two metrics suggests something is broken that scaling 
isn't going to fix.
   
   > * `SeekableStreamSupervisorIOConfig` did not honor the correct priority 
order of `taskCountStart` > `taskCount` > `taskCountMin`
   
   Huh, I thought we already changed that in #18745. Could you please also 
update the comment above the code block that currently says: `// if autoscaler 
is enabled, then taskCount will be ignored`.
   
   Please also update `docs/ingestion/supervisor.md`. It says some things that 
are no longer accurate, such as saying in the docs for `taskCountMin` and 
`taskCountStart` that "When you enable the autoscaler, Druid ignores the value 
of taskCount".
   
   > I want to merge the current patch and verify if the 
`isAnotherTaskGroupPublishing()` method is effective in identifying if there 
are prior pending publish actions. If it works as expected, I think we can 
build upon it further to get the queue semantics on the supervisor.
   > 
   > @gianm , @jtuglu1 , what are your thoughts?
   
   I'm ok with going ahead with this patch given that more work is coming in 
this area, and that hopefully the `retryable` flag is going to prevent us from 
retrying things that will never succeed. I also believe that it would be better 
to move away from retries completely. The supervisor should be able to do more 
to help tasks sequence their publishes. That would also fix the long-standing 
issue that can happen with short-lived tasks: if task A is publishing, and its 
replacement task B runs for a short period of time and then starts publishing, 
it can finish before A and fail with a mismatched offsets error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to