zhangyue19921010 opened a new issue #10824: URL: https://github.com/apache/druid/issues/10824
# Motivation Druid use MiddleManager service to launch Peons for data ingestion. Users can set `druid.indexer.runner.javaOpts` in MiddleManager runtime.properties to control the JVM config of Peon like memory size. Overlord will schedule peon running on the property MiddleManager node based on task slots. As for current resource scheduling model mentioned above, there are a few limitations: 1. The resources utilization of MiddleManager node is uncontrollable, and MiddleManager needs to occupy a large amount of memory resources in advance to provide sufficient resource for the possible peons. 2. Different types of tasks need to use the same resource properties, causing a waste. For example, a lower workload batch task also need to use unified resources same as Kafka ingestion task. Although users can set the property `druid.indexer.runner.javaOpts` in the task context to modify the JVM parameters of a specific peon. But current Druid resource scheduling mode is based on slots. So that users can only specify a smaller memory size, because if set a larger memory size in task context, it will cause the memory of MiddleManager to be over-allocated and OOM. On the other hand, because of resources pre-allocated, setting a lower memory size in a specify peon here is meaningless. 3. Peon will need CPU resources to do calculations or respond to queries and different types of tasks have different requirements for the CPU. Current Resource Schedule Model did not limit cpu resources. It may cause a waste of CPU resources when multiple low cpu tasks running on the same MiddleManager node. Or it may cause excessive cpu usage leads to longer query time when multiple high cpu tasks run on the same MiddleManager node. Therefore, it is also necessary to limit cpu resources. # Proposed changes A new extension-contrib `druid-kubernetes-middlemanager-extensions` would be added with implementations of `BasedRestorableTaskRunner` named `K8sForkingTaskRunner`, a new module named K8sMiddleManagerModule and so on. Additionally, since this is first such extension, there might be some changes needed in core as well to enable writing the extension. Also will add some new properties in MiddleManager runtime.properties: Property | Description | Default -- | -- | -- druid.indexer.runner.mode | The running mode of MiddleManger-Peon. If set `druid.indexer.runner.mode=k8s`. MiddleManager will create and own Peon pod to do ingest action on K8s. | native druid.indexer.namespace | The namespace of Druid cluster on K8s. | default druid.indexer.image | The Druid based image. | druid/cluster:v1 druid.indexer.default.pod.memory | The default memory limitation of peon pod created by MiddleManager | 2G druid.indexer.default.pod.cpu | The default cpu limitation of peon pod created by MiddleManager | 1 Add some new properties in task context Property | Description | Default -- | -- | -- druid.peon.javaOpts | The JVM configs of specific peon pod. | JVM configs in MiddleManager runtime.properties. druid.peon.pod.memory | The memory limitation of specific peon pod created by MiddleManager | JVM configs in MiddleManager runtime.properties. druid.peon.pod.cpu | The cpu limitation of specific peon pod created by MiddleManager | JVM configs in MiddleManager runtime.properties. As you can see, the priority of properties mentioned above is `Task Context > runtime.properties > Coding default values`. Need to add "druid-kubernetes-middlemanager-extensions" in `druid.extensions.loadList` only for MiddleManager runtime.properties. # Rationale Based on ForkingTaskRunner, make a new runner named K8sForkingTaskRunner. Instead of using `ProcessBuilder.start()` to create a create a new child process in ForkingTaskRunner. We use kubernetes-java-client to create and running tasks in peon pod. Also do stop, trace, log and garbage collection through K8s. 1. Use ConfigMap to pass `task.json` from MiddleManager to Peon pod. There is a conflict between local dictionary and configmap mountPath. MountPath doesn't allow to use ":" in path. So we have to do the pass carefully. 2. Use ownerReference to do garbage collection, so that when peon is done. everything related like configmap will be deleted automatically. 3. Need to do communication between MiddleManager and Peon pod for log collection and lifecycle control. 4. Use kubernate-java-client do `create pod`, `wait for pod running`, `wait for pod finished` and so on. <img width="852" alt="屏幕快照 2021-02-01 下午2 41 04" src="https://user-images.githubusercontent.com/69956021/106423501-88a2a100-649b-11eb-8da5-e962d49b7d06.png"> <img width="625" alt="屏幕快照 2021-02-01 下午2 35 48" src="https://user-images.githubusercontent.com/69956021/106423538-948e6300-649b-11eb-9a6f-1169e9b87092.png"> # Advantage Cost saving and Improve resource utilization. We just use peon pod to do data ingestion and let K8s cluster to do Resource Scheduling work which K8s is good at. When Druid cluster enable MOK, Users can set different cpu/memory resources for different tasks. And K8s will schedule and run this peon pod with high resource utilization. Also If we combine pod and something like AWS Fargate(https://aws.amazon.com/fargate/). Resource usage and cost can further improve. MiddleManager can temporarily require for appropriate resources(you just need to pay for the sources which are required here) and run peon pod. AND release these resources after task finished. In short, there is no need to let MiddleManager take up a lot of resources in advance, and just require resources whenever it will use. # Operational impact None # Test plan(optional) I would be testing the extension on Dev Druid clusters deployed in K8s including data ingestion and data query. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
