[GitHub] [druid] zhangyue19921010 opened a new issue #10824: A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK)

GitBox Sun, 31 Jan 2021 22:47:40 -0800


zhangyue19921010 opened a new issue #10824:
URL: https://github.com/apache/druid/issues/10824

# Motivation
Druid use MiddleManager service to launch Peons for data ingestion. Users
can set `druid.indexer.runner.javaOpts` in MiddleManager runtime.properties to
control the JVM config of Peon like memory size. Overlord will schedule peon
running on the property MiddleManager node based on task slots.

As for current resource scheduling model mentioned above, there are a few
limitations:

1. The resources utilization of MiddleManager node is uncontrollable, and
MiddleManager needs to occupy a large amount of memory resources in advance to
provide sufficient resource for the possible peons.
2. Different types of tasks need to use the same resource properties,
causing a waste. For example, a lower workload batch task also need to use
unified resources same as Kafka ingestion task. Although users can set the
property `druid.indexer.runner.javaOpts` in the task context to modify the JVM
parameters of a specific peon. But current Druid resource scheduling mode is
based on slots. So that users can only specify a smaller memory size, because
if set a larger memory size in task context, it will cause the memory of
MiddleManager to be over-allocated and OOM. On the other hand, because of
resources pre-allocated, setting a lower memory size in a specify peon here is
meaningless.
3. Peon will need CPU resources to do calculations or respond to queries and
different types of tasks have different requirements for the CPU. Current
Resource Schedule Model did not limit cpu resources. It may cause a waste of
CPU resources when multiple low cpu tasks running on the same MiddleManager
node. Or it may cause excessive cpu usage leads to longer query time when
multiple high cpu tasks run on the same MiddleManager node. Therefore, it is
also necessary to limit cpu resources.

# Proposed changes
A new extension-contrib `druid-kubernetes-middlemanager-extensions` would be
added with implementations of `BasedRestorableTaskRunner` named
`K8sForkingTaskRunner`, a new module named K8sMiddleManagerModule and so on.
Additionally, since this is first such extension, there might be some
changes needed in core as well to enable writing the extension.

Also will add some new properties in MiddleManager runtime.properties:

Add some new properties in task context

As you can see, the priority of properties mentioned above is `Task Context
> runtime.properties > Coding default values`.

Need to add "druid-kubernetes-middlemanager-extensions" in
`druid.extensions.loadList` only for MiddleManager runtime.properties.

# Rationale
Based on ForkingTaskRunner, make a new runner named K8sForkingTaskRunner.

Instead of using `ProcessBuilder.start()` to create a create a new child
process in ForkingTaskRunner. We use kubernetes-java-client to create and
running tasks in peon pod. Also do stop, trace, log and garbage collection
through K8s.

1. Use ConfigMap to pass `task.json` from MiddleManager to Peon pod. There
is a conflict between local dictionary and configmap mountPath. MountPath
doesn't allow to use ":" in path. So we have to do the pass carefully.
2. Use ownerReference to do garbage collection, so that when peon is done.
everything related like configmap will be deleted automatically.
3. Need to do communication between MiddleManager and Peon pod for log
collection and lifecycle control.
4. Use kubernate-java-client do `create pod`, `wait for pod running`, `wait
for pod finished` and so on.

# Advantage
Cost saving and Improve resource utilization.

We just use peon pod to do data ingestion and let K8s cluster to do Resource
Scheduling work which K8s is good at. When Druid cluster enable MOK, Users can
set different cpu/memory resources for different tasks. And K8s will schedule
and run this peon pod with high resource utilization.

Also If we combine pod and something like AWS
Fargate(https://aws.amazon.com/fargate/). Resource usage and cost can further
improve. MiddleManager can temporarily require for appropriate resources(you
just need to pay for the sources which are required here) and run peon pod. AND
release these resources after task finished.

In short, there is no need to let MiddleManager take up a lot of resources
in advance, and just require resources whenever it will use.

# Operational impact
None

# Test plan(optional)
I would be testing the extension on Dev Druid clusters deployed in K8s
including data ingestion and data query.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] zhangyue19921010 opened a new issue #10824: A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK)

Reply via email to