[GitHub] [druid] himanshug commented on issue #8801: KubernetesTaskRunner for running druid tasks as kubernetes jobs

GitBox Wed, 29 Jul 2020 02:17:24 -0700


himanshug commented on issue #8801:
URL: https://github.com/apache/druid/issues/8801#issuecomment-664648399

Original proposal here, to have a K8sTaskRunner, that creates one k8s pod(or
job with replica count = 1) per peon entirely removes MMs from the equation.
Then "autoscaling" essentially becomes responsibility of K8S scheduler itself.
So, that is all great and beneficial for a good class of users.
Some users have custom lookup implementations for large lookups data sizes,
wherein lookup data is loaded at the MM level and shared by all the peons on
that MM which alleviates a lot of resource wastage. For users of such or other
similar features that somehow are benefitted by "sharing" of resources across
all Peons running on same MM, above K8sTaskRunner would not be the best option
for autoscale probably unless we could still allow the sharing by using EBS etc
even in this world.

So, even after K8sTaskRunner (and also easier to implement), it would still
be beneficial to some users to have an auto scaling solution that continues to
have MMs. Now, there are multiple options of achieving that.

1. Use combination of k8s HPA and custom metrics to achieve that MM
autoscaling requirements. From previous comment, it sounds like that HPA's
mechanisms are not sophisticated enough at this time to support our
requirements. So, this one will be a good choice at some point in the far
future when k8s community improves things around HPA.

2. Let [Druid Operator](https://github.com/druid-io/druid-operator) manage
the autoscaling totally transparently. That is what is proposed in
> The operator on each reconcile, can hit this endpoint for each MM
/druid/worker/v1/tasks and then scale down the particular deployment of MM (
meaning scale down rs to 0, not to delete the MM ). Scale up can be done using
v1/indexer/pendingTask endpoint to increase the count of MM.

Personally, I think Druid's Overlord process has best access to all state
necessary for making auto scaling related decisions, already implements all(and
more) of above logic and delegates just the provisioning part to `AutoScaler`
interface . So, in my mind, it would be better to write a `K8sAutoScaler` to
manage MM autoscaling.

From implementation standpoint, one crucial problem is "telling k8s to scale
down by killing a particular MM pod" since k8s `Deployment`/`StatefulSet`
resources can't be told which pod[s] to kill while scaling down. One workaround
is, of course, to create one `Deployment` or `StatefulSet` resource for each MM
individually. This workaround is good because this doesn't need anything
non-standard. However, `Deployment` or `StatefulSet` etc with replica count 1
don't play well with things like `PodDisruptionBudget`.

I would also investigate to see if we can use [CloneSet
](https://github.com/openkruise/kruise/blob/master/docs/concepts/cloneset/README.md#selective-pod-deletion)
which has the scale down feature that we need. At some point, I think,
`Deployment` would support [that
feature](https://github.com/kubernetes/kubernetes/issues/45509) and we can
switch to standard `Deployment` instead of `CloneSet` then. Obvious downside is
that k8s cluster then needs the controller for `CloneSet` which isn't
automatically available in a vanilla k8s cluster installation. With `CloneSet`
we can have `druid-operator` deploy the `CloneSet` resource and `K8sAutoScaler`
would just update the replica counts to scale up/down. In that world,
`druid-operator` would continue to be source of truth about all of deployment
except only the MM replica count.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] himanshug commented on issue #8801: KubernetesTaskRunner for running druid tasks as kubernetes jobs

Reply via email to