himanshug commented on pull request #10524: URL: https://github.com/apache/druid/pull/10524#issuecomment-760533912
As I am going through the code, I notice, overall the patch does following.... (1) periodically update lag statistics in the background (2) periodically trigger the "autoscaling algorithm" to see if tasks need to be scaled up/down (3) if "autoscaling algorithm" figures out scale up/down is needed, then it updates new taskCount in the spec in DB and (4) then "trigger" the scale up/down In current design, everything is embedded inside `SeekableStreamSupervisor` . However, it looks like that (1), (2) and (3) can happen independently and outside of `SeekableStreamSupervisor` to minimize putting more stuff inside there... only potential change inside `SeekableStreamSupervisor` needed is to make it accept a way to trigger task creation/deletion based on updated taskCount in the spec. Idea can be taken even a little further. You have used a specific "autoscaling algorithm" based on heuristics provided by the user .. I can totally imagine other version of "autoscaling algorithm" that has more intelligence inside it to rely on less and less information from the user. I can even see the need of being able to support multiple "autoscaling algorithms" that user can pick-and-choose from so as to allow experimentation of different "autoscaling algorithms" without impacting existing users. In fact, a far away thought, users might want to have their own autoscaler algorithms based on stats specific to their deployment instead of kafka topic lag e.g. checking the load metrics of their pipeline which could trigger autoscale way before things start lagging. One quick line of thought I had while reading the code was that we can have something like... ``` // add a new method to SupervisorSpec interface SupervisorSpec { ... SupervisorTaskAutoscaler createAutoscaler(); ... } // start() implementation sets up necessary scheduler of whatever to collect stats, run its own autoscaling algorithm logic // then if needed, update taskCount in spec and send a signal to supervisor to reconcile based on new taskCount interface SupervisorTaskAutoscaler { void start(); void stop(); } // Update SupervisorManager to also manage autoscaler in various places as needed i.e. introduce code like SupervisorTaskAutoscaler autoscaler = spec.createAutoscaler(); if (autoscaler != null) { autoscaler.start() or stop() } // Now you can have mulitple implementations of SupervisorTaskAutoscaler that // could do things there own way, we can even make it extensible ``` it would be great if is feasible to re-arrange this PR into something like above. I am sure I might have missed some important detail, but wanted to note my ideas here to know what you/others think about that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org