zhangyue19921010 opened a new issue #10824:
URL: https://github.com/apache/druid/issues/10824


   # Motivation
   Druid use MiddleManager service to launch Peons for data ingestion. Users 
can set `druid.indexer.runner.javaOpts` in MiddleManager runtime.properties to 
control the JVM config of Peon like memory size. Overlord will schedule peon 
running on the property MiddleManager node based on task slots.
   
   As for current resource scheduling model mentioned above, there are a few 
limitations:
   
   1. The resources utilization of MiddleManager node is uncontrollable, and 
MiddleManager needs to occupy a large amount of memory resources in advance to 
provide sufficient resource for the possible peons.
   2. Different types of tasks need to use the same resource properties, 
causing a waste. For example, a lower workload batch task also need to use 
unified resources same as Kafka ingestion task. Although users can set the 
property `druid.indexer.runner.javaOpts` in the task context to modify the JVM 
parameters of a specific peon. But current Druid resource scheduling mode is 
based on slots. So that users can only specify a smaller memory size, because 
if set a larger memory size in task context, it will cause the memory of 
MiddleManager to be over-allocated and OOM. On the other hand, because of 
resources pre-allocated, setting a lower memory size in a specify peon here is 
meaningless.
   3. Peon will need CPU resources to do calculations or respond to queries and 
different types of tasks have different requirements for the CPU. Current 
Resource Schedule Model did not limit cpu resources. It may cause a waste of 
CPU resources when multiple low cpu tasks running on the same MiddleManager 
node. Or it may cause excessive cpu usage leads to longer query time when 
multiple high cpu tasks run on the same MiddleManager node. Therefore, it is 
also necessary to limit cpu resources.
   
   # Proposed changes
   A new extension-contrib `druid-kubernetes-middlemanager-extensions` would be 
added with implementations of `BasedRestorableTaskRunner` named 
`K8sForkingTaskRunner`, a new module named K8sMiddleManagerModule and so on.
   Additionally, since this is first such extension, there might be some 
changes needed in core as well to enable writing the extension.
   
   Also will add some new properties in MiddleManager runtime.properties:
   
   Property | Description | Default
   -- | -- | --
   druid.indexer.runner.mode | The running mode of MiddleManger-Peon. If set 
`druid.indexer.runner.mode=k8s`. MiddleManager will create and own Peon pod to 
do ingest action on K8s. | native
   druid.indexer.namespace | The namespace of Druid cluster on K8s. | default
   druid.indexer.image | The Druid based image. | druid/cluster:v1
   druid.indexer.default.pod.memory | The default memory limitation of peon pod 
created by MiddleManager | 2G
   druid.indexer.default.pod.cpu | The default cpu limitation of peon pod 
created by MiddleManager | 1
   
   Add some new properties in task context 
   
   Property | Description | Default
   -- | -- | --
   druid.peon.javaOpts | The JVM configs of specific peon pod. | JVM configs in 
MiddleManager runtime.properties.
   druid.peon.pod.memory | The memory limitation of specific peon pod created 
by MiddleManager | JVM configs in MiddleManager runtime.properties.
   druid.peon.pod.cpu | The cpu limitation of specific peon pod created by 
MiddleManager | JVM configs in MiddleManager runtime.properties.
   
   As you can see, the priority of properties mentioned above is `Task Context 
> runtime.properties > Coding default values`.
   
   Need to add "druid-kubernetes-middlemanager-extensions" in 
`druid.extensions.loadList` only for MiddleManager runtime.properties.
   
   # Rationale
   Based on ForkingTaskRunner, make a new runner named K8sForkingTaskRunner. 
   
   Instead of using `ProcessBuilder.start()` to create a create a new child 
process in ForkingTaskRunner. We use kubernetes-java-client to create and 
running tasks in peon pod. Also do stop, trace, log and garbage collection 
through K8s.
   
   1. Use ConfigMap to pass `task.json` from MiddleManager to Peon pod. There 
is a conflict between local dictionary and configmap mountPath. MountPath 
doesn't allow to use ":" in path. So we have to do the pass carefully.
   2. Use  ownerReference to do garbage collection, so that when peon is done. 
everything related like configmap will be deleted automatically. 
   3. Need to do communication between MiddleManager and Peon pod for log 
collection and lifecycle control.
   4. Use kubernate-java-client do `create pod`, `wait for pod running`, `wait 
for pod finished` and so on.
   
   <img width="852" alt="屏幕快照 2021-02-01 下午2 41 04" 
src="https://user-images.githubusercontent.com/69956021/106423501-88a2a100-649b-11eb-8da5-e962d49b7d06.png";>
   
   <img width="625" alt="屏幕快照 2021-02-01 下午2 35 48" 
src="https://user-images.githubusercontent.com/69956021/106423538-948e6300-649b-11eb-9a6f-1169e9b87092.png";>
   
   # Advantage
   Cost saving and Improve resource utilization.
   
   We just use peon pod to do data ingestion and let K8s cluster to do Resource 
Scheduling work which K8s is good at. When Druid cluster enable MOK, Users can 
set different cpu/memory resources for different tasks. And K8s will schedule 
and run this peon pod with high resource utilization.
   
   Also If we combine pod and something like AWS 
Fargate(https://aws.amazon.com/fargate/). Resource usage and cost can further 
improve. MiddleManager can temporarily require for appropriate resources(you 
just need to pay for the sources which are required here) and run peon pod. AND 
release these resources after task finished. 
   
   In short, there is no need to let MiddleManager take up a lot of resources 
in advance, and just require resources whenever it will use.
   
   # Operational impact
   None
   
   # Test plan(optional)
   I would be testing the extension on Dev Druid clusters deployed in K8s 
including data ingestion and data query.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to