himanshug commented on issue #8801: URL: https://github.com/apache/druid/issues/8801#issuecomment-664648399
Original proposal here, to have a K8sTaskRunner, that creates one k8s pod(or job with replica count = 1) per peon entirely removes MMs from the equation. Then "autoscaling" essentially becomes responsibility of K8S scheduler itself. So, that is all great and beneficial for a good class of users. Some users have custom lookup implementations for large lookups data sizes, wherein lookup data is loaded at the MM level and shared by all the peons on that MM which alleviates a lot of resource wastage. For users of such or other similar features that somehow are benefitted by "sharing" of resources across all Peons running on same MM, above K8sTaskRunner would not be the best option for autoscale probably unless we could still allow the sharing by using EBS etc even in this world. So, even after K8sTaskRunner (and also easier to implement), it would still be beneficial to some users to have an auto scaling solution that continues to have MMs. Now, there are multiple options of achieving that. 1. Use combination of k8s HPA and custom metrics to achieve that MM autoscaling requirements. From previous comment, it sounds like that HPA's mechanisms are not sophisticated enough at this time to support our requirements. So, this one will be a good choice at some point in the far future when k8s community improves things around HPA. 2. Let [Druid Operator](https://github.com/druid-io/druid-operator) manage the autoscaling totally transparently. That is what is proposed in > The operator on each reconcile, can hit this endpoint for each MM /druid/worker/v1/tasks and then scale down the particular deployment of MM ( meaning scale down rs to 0, not to delete the MM ). Scale up can be done using v1/indexer/pendingTask endpoint to increase the count of MM. Personally, I think Druid's Overlord process has best access to all state necessary for making auto scaling related decisions, already implements all(and more) of above logic and delegates just the provisioning part to `AutoScaler` interface . So, in my mind, it would be better to write a `K8sAutoScaler` to manage MM autoscaling. From implementation standpoint, one crucial problem is "telling k8s to scale down by killing a particular MM pod" since k8s `Deployment`/`StatefulSet` resources can't be told which pod[s] to kill while scaling down. One workaround is, of course, to create one `Deployment` or `StatefulSet` resource for each MM individually. This workaround is good because this doesn't need anything non-standard. However, `Deployment` or `StatefulSet` etc with replica count 1 don't play well with things like `PodDisruptionBudget`. I would also investigate to see if we can use [CloneSet ](https://github.com/openkruise/kruise/blob/master/docs/concepts/cloneset/README.md#selective-pod-deletion) which has the scale down feature that we need. At some point, I think, `Deployment` would support [that feature](https://github.com/kubernetes/kubernetes/issues/45509) and we can switch to standard `Deployment` instead of `CloneSet` then. Obvious downside is that k8s cluster then needs the controller for `CloneSet` which isn't automatically available in a vanilla k8s cluster installation. With `CloneSet` we can have `druid-operator` deploy the `CloneSet` resource and `K8sAutoScaler` would just update the replica counts to scale up/down. In that world, `druid-operator` would continue to be source of truth about all of deployment except only the MM replica count. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
