himanshug commented on issue #8801:
URL: https://github.com/apache/druid/issues/8801#issuecomment-664648399


   Original proposal here, to have a K8sTaskRunner, that creates one k8s pod(or 
job with replica count = 1) per peon entirely removes MMs from the equation. 
Then "autoscaling" essentially becomes responsibility of K8S scheduler itself. 
So, that is all great and beneficial for a good class of users.
   Some users have custom lookup implementations for large lookups data sizes, 
wherein lookup data is loaded at the MM level and shared by all the peons on 
that MM which alleviates a lot of resource wastage. For users of such or other 
similar features that somehow are benefitted by "sharing" of resources across 
all Peons running on same MM, above K8sTaskRunner would not be the best option 
for autoscale probably unless we could still allow the sharing by using EBS etc 
even in this world.
   
   So, even after K8sTaskRunner (and also easier to implement), it would still 
be beneficial to some users to have an auto scaling solution that continues to 
have MMs. Now, there are multiple options of achieving that.
   
   1. Use combination of k8s HPA and custom metrics to achieve that MM 
autoscaling requirements. From previous comment, it sounds like that HPA's 
mechanisms are not sophisticated enough at this time to support our 
requirements. So, this one will be a good choice at some point in the far 
future when k8s community improves things around HPA.
   
   2. Let [Druid Operator](https://github.com/druid-io/druid-operator) manage 
the autoscaling totally transparently. That is what is proposed in 
   > The operator on each reconcile, can hit this endpoint for each MM 
/druid/worker/v1/tasks and then scale down the particular deployment of MM ( 
meaning scale down rs to 0, not to delete the MM ). Scale up can be done using 
v1/indexer/pendingTask endpoint to increase the count of MM.
   
   Personally, I think Druid's Overlord process has best access to all state 
necessary for making auto scaling related decisions, already implements all(and 
more) of above logic and delegates just the provisioning part to `AutoScaler` 
interface . So, in my mind, it would be better to write a `K8sAutoScaler` to 
manage MM autoscaling.
   
   From implementation standpoint, one crucial problem is "telling k8s to scale 
down by killing a particular MM pod" since k8s `Deployment`/`StatefulSet` 
resources can't be told which pod[s] to kill while scaling down. One workaround 
is, of course, to create one `Deployment` or `StatefulSet` resource for each MM 
individually. This workaround is good because this doesn't need anything 
non-standard. However, `Deployment` or `StatefulSet` etc with replica count 1 
don't play well with things like `PodDisruptionBudget`.
   
   I would also investigate to see if we can use [CloneSet 
](https://github.com/openkruise/kruise/blob/master/docs/concepts/cloneset/README.md#selective-pod-deletion)
 which has the scale down feature that we need. At some point, I think, 
`Deployment` would support [that 
feature](https://github.com/kubernetes/kubernetes/issues/45509) and we can 
switch to standard `Deployment` instead of `CloneSet` then. Obvious downside is 
that k8s cluster then needs the controller for `CloneSet` which isn't 
automatically available in a vanilla k8s cluster installation. With `CloneSet` 
we can have `druid-operator` deploy the `CloneSet` resource and `K8sAutoScaler` 
would just update the replica counts to scale up/down. In that world, 
`druid-operator` would continue to be source of truth about all of deployment 
except only the MM replica count.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to