This is an automated email from the ASF dual-hosted git repository.
suneet pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new 174053f4fd Add readme for kubernetes-overlord-extensions and update
docs (#14674)
174053f4fd is described below
commit 174053f4fd250534da1f1aa4e7e3ff6c188e3c1a
Author: George Shiqi Wu <[email protected]>
AuthorDate: Tue Aug 1 16:29:44 2023 -0400
Add readme for kubernetes-overlord-extensions and update docs (#14674)
* Add readme for kubernetes task scheduler
* clean up uneeded stuff
* Update extensions-contrib/kubernetes-overlord-extensions/README.md
Co-authored-by: Abhishek Agarwal
<[email protected]>
* Move documentation into main page
* indentation
* cleanup spellcheck errors
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Suneet Saldanha <[email protected]>
* Update extensions-contrib/kubernetes-overlord-extensions/README.md
Co-authored-by: Suneet Saldanha <[email protected]>
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Suneet Saldanha <[email protected]>
* PR comments
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Suneet Saldanha <[email protected]>
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Suneet Saldanha <[email protected]>
* Update docs/development/extensions-contrib/k8s-jobs.md
Co-authored-by: Suneet Saldanha <[email protected]>
---------
Co-authored-by: Abhishek Agarwal
<[email protected]>
Co-authored-by: Suneet Saldanha <[email protected]>
---
docs/development/extensions-contrib/k8s-jobs.md | 268 +++++++++++++++------
.../kubernetes-overlord-extensions/README.md | 32 +++
2 files changed, 227 insertions(+), 73 deletions(-)
diff --git a/docs/development/extensions-contrib/k8s-jobs.md
b/docs/development/extensions-contrib/k8s-jobs.md
index 7264694b6c..cd925c2ee0 100644
--- a/docs/development/extensions-contrib/k8s-jobs.md
+++ b/docs/development/extensions-contrib/k8s-jobs.md
@@ -28,73 +28,33 @@ Consider this an [EXPERIMENTAL](../experimental.md) feature
mostly because it ha
## How it works
-The K8s extension builds a pod spec using the specified pod adapter, the
default implementation takes the podSpec of your `Overlord` pod and creates a
kubernetes job from this podSpec. Thus if you have sidecars such as Splunk or
Istio it can optionally launch a task as a K8s job. All jobs are natively
restorable, they are decoupled from the druid deployment, thus restarting pods
or doing upgrades has no affect on tasks in flight. They will continue to run
and when the overlord comes b [...]
+The K8s extension builds a pod spec for each task using the specified pod
adapter. All jobs are natively restorable, they are decoupled from the Druid
deployment, thus restarting pods or doing upgrades has no affect on tasks in
flight. They will continue to run and when the overlord comes back up it will
start tracking them again.
-## Pod Adapters
-The logic defining how the pod template is built for your kubernetes job
depends on which pod adapter you have specified.
-
-### Overlord Single Container Pod Adapter
-The overlord single container pod adapter takes the podSpec of your `Overlord`
pod and creates a kubernetes job from this podSpec. This is the default pod
adapter implementation, to explicitly enable it you can specify the runtime
property `druid.indexer.runner.k8s.adapter.type: overlordSingleContainer`
-
-### Overlord Multi Container Pod Adapter
-The overlord multi container pod adapter takes the podSpec of your `Overlord`
pod and creates a kubernetes job from this podSpec. It uses kubexit to manage
dependency ordering between the main container that runs your druid peon and
other sidecars defined in the `Overlord` pod spec. To enable this pod adapter
you can specify the runtime property `druid.indexer.runner.k8s.adapter.type:
overlordMultiContainer`
-
-### Custom Template Pod Adapter
-The custom template pod adapter allows you to specify a pod template file per
task type. This adapter requires you to specify a `base` pod spec which will
be used in the case that a task specific pod spec has not been defined. To
enable this pod adapter you can specify the runtime property
`druid.indexer.runner.k8s.adapter.type: customTemplateAdapter`
-
-The base pod template must be specified as the runtime property
`druid.indexer.runner.k8s.podTemplate.base: /path/to/basePodSpec.yaml`
-Task specific pod templates must be specified as the runtime property
`druid.indexer.runner.k8s.podTemplate.{taskType}:
/path/to/taskSpecificPodSpec.yaml` where {taskType} is the name of the task
type i.e `index_parallel`
## Configuration
To use this extension please make sure to
[include](../../configuration/extensions.md#loading-extensions)`druid-kubernetes-overlord-extensions`
in the extensions load list for your overlord process.
-The extension uses the task queue to limit how many concurrent tasks (K8s
jobs) are in flight so it is required you have a reasonable value for
`druid.indexer.queue.maxSize`. Additionally set the variable
`druid.indexer.runner.namespace` to the namespace in which you are running
druid.
+The extension uses `druid.indexer.runner.capacity` to limit the number of k8s
jobs in flight. A good initial value for this would be the sum of the total
task slots of all the middle managers you were running before switching to K8s
based ingestion. The K8s task runner uses one thread per Job that is created,
so setting this number too large can cause memory issues on the overlord.
Additionally set the variable `druid.indexer.runner.namespace` to the namespace
in which you are running druid.
-Other configurations required are:
+Other configurations required are:
`druid.indexer.runner.type: k8s` and `druid.indexer.task.encapsulatedTask:
true`
-You can add optional labels to your K8s jobs / pods if you need them by using
the following configuration:
-`druid.indexer.runner.labels: '{"key":"value"}'`
-Annotations are the same with:
-`druid.indexer.runner.annotations: '{"key":"value"}'`
-
-All other configurations you had for the middle manager tasks must be moved
under the overlord with one caveat, you must specify javaOpts as an array:
-`druid.indexer.runner.javaOptsArray`, `druid.indexer.runner.javaOpts` is no
longer supported.
-
-If you are running without a middle manager you need to also use
`druid.processing.intermediaryData.storage.type=deepstore`
-
-Additional Configuration
+## Pod Adapters
+The logic defining how the pod template is built for your Kubernetes Job
depends on which pod adapter you have specified.
-### Properties
-|Property|Possible Values|Description|Default|required|
-|--------|---------------|-----------|-------|--------|
-|`druid.indexer.runner.debugJobs`|`boolean`|Clean up K8s jobs after tasks
complete.|False|No|
-|`druid.indexer.runner.sidecarSupport`|`boolean`|Deprecated, specify adapter
type as runtime property `druid.indexer.runner.k8s.adapter.type:
overlordMultiContainer` instead. If your overlord pod has sidecars, this will
attempt to start the task with the same sidecars as the overlord pod.|False|No|
-|`druid.indexer.runner.primaryContainerName`|`String`|If running with
sidecars, the `primaryContainerName` should be that of your druid container
like `druid-overlord`.|First container in `podSpec` list|No|
-|`druid.indexer.runner.kubexitImage`|`String`|Used kubexit project to help
shutdown sidecars when the main pod completes. Otherwise jobs with sidecars
never terminate.|karlkfi/kubexit:latest|No|
-|`druid.indexer.runner.disableClientProxy`|`boolean`|Use this if you have a
global http(s) proxy and you wish to bypass it.|false|No|
-|`druid.indexer.runner.maxTaskDuration`|`Duration`|Max time a task is allowed
to run for before getting killed|`PT4H`|No|
-|`druid.indexer.runner.taskCleanupDelay`|`Duration`|How long do jobs stay
around before getting reaped from K8s|`P2D`|No|
-|`druid.indexer.runner.taskCleanupInterval`|`Duration`|How often to check for
jobs to be reaped|`PT10M`|No|
-|`druid.indexer.runner.K8sjobLaunchTimeout`|`Duration`|How long to wait to
launch a K8s task before marking it as failed, on a resource constrained
cluster it may take some time.|`PT1H`|No|
-|`druid.indexer.runner.javaOptsArray`|`JsonArray`|java opts for the
task.|`-Xmx1g`|No|
-|`druid.indexer.runner.labels`|`JsonObject`|Additional labels you want to add
to peon pod|`{}`|No|
-|`druid.indexer.runner.annotations`|`JsonObject`|Additional annotations you
want to add to peon pod|`{}`|No|
-|`druid.indexer.runner.peonMonitors`|`JsonArray`|Overrides
`druid.monitoring.monitors`. Use this property if you don't want to inherit
monitors from the Overlord.|`[]`|No|
-|`druid.indexer.runner.graceTerminationPeriodSeconds`|`Long`|Number of seconds
you want to wait after a sigterm for container lifecycle hooks to complete.
Keep at a smaller value if you want tasks to hold locks for shorter
periods.|`PT30S` (K8s default)|No|
+### Overlord Single Container Pod Adapter/Overlord Multi Container Pod Adapter
+The overlord single container pod adapter takes the podSpec of your `Overlord`
pod and creates a kubernetes job from this podSpec. This is the default pod
adapter implementation, to explicitly enable it you can specify the runtime
property `druid.indexer.runner.k8s.adapter.type: overlordSingleContainer`
-### Gotchas
+The overlord multi container pod adapter takes the podSpec of your `Overlord`
pod and creates a kubernetes job from this podSpec. It uses kubexit to manage
dependency ordering between the main container that runs your druid peon and
other sidecars defined in the `Overlord` pod spec. Thus if you have sidecars
such as Splunk or Istio it will be able to handle them. To enable this pod
adapter you can specify the runtime property
`druid.indexer.runner.k8s.adapter.type: overlordMultiContainer`
-- You must have in your role the ability to launch jobs.
-- All Druid Pods belonging to one Druid cluster must be inside same kubernetes
namespace.
-- For the sidecar support to work, your entry point / command in docker must
be explicitly defined your spec.
+For the sidecar support to work for the multi container pod adapter, your
entry point / command in docker must be explicitly defined your spec.
-You can't have something like this:
-Dockerfile:
+You can't have something like this:
+Dockerfile:
``` ENTRYPOINT: ["foo.sh"] ```
-and in your sidecar specs:
+and in your sidecar specs:
``` container:
name: foo
args:
@@ -102,11 +62,11 @@ and in your sidecar specs:
- arg2
```
-That will not work, because we cannot decipher what your command is, the
extension needs to know it explicitly.
-**Even for sidecars like Istio which are dynamically created by the service
mesh, this needs to happen.*
+That will not work, because we cannot decipher what your command is, the
extension needs to know it explicitly.
+**Even for sidecars like Istio which are dynamically created by the service
mesh, this needs to happen.*
-Instead do the following:
-You can keep your Dockerfile the same but you must have a sidecar spec like
so:
+Instead do the following:
+You can keep your Dockerfile the same but you must have a sidecar spec like so:
``` container:
name: foo
command: foo.sh
@@ -115,33 +75,195 @@ You can keep your Dockerfile the same but you must have a
sidecar spec like so:
- arg2
```
-The following roles must also be accessible. An example spec could be:
+For both of these adapters, you can add optional labels to your K8s jobs /
pods if you need them by using the following configuration:
+`druid.indexer.runner.labels: '{"key":"value"}'`
+Annotations are the same with:
+`druid.indexer.runner.annotations: '{"key":"value"}'`
+All other configurations you had for the middle manager tasks must be moved
under the overlord with one caveat, you must specify javaOpts as an array:
+`druid.indexer.runner.javaOptsArray`, `druid.indexer.runner.javaOpts` is no
longer supported.
+
+If you are running without a middle manager you need to also use
`druid.processing.intermediaryData.storage.type=deepstore`
+
+### Custom Template Pod Adapter
+The custom template pod adapter allows you to specify a pod template file per
task type for more flexibility on how to define your pods. This adapter expects
a [Pod
Template](https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates) to
be available on the overlord's file system. This pod template is used as the
base of the pod spec for the Kubernetes Job. You can override things like
labels, environment variables, resources, annotation, or even the base image
with this template. [...]
+
+The base pod template must be specified as the runtime property
`druid.indexer.runner.k8s.podTemplate.base: /path/to/basePodSpec.yaml`
+
+Task specific pod templates can be specified as the runtime property
`druid.indexer.runner.k8s.podTemplate.{taskType}:
/path/to/taskSpecificPodSpec.yaml` where {taskType} is the name of the task
type i.e `index_parallel`
+
+The following is an example Pod Template that uses the regular druid docker
image.
+```
+apiVersion: "v1"
+kind: "PodTemplate"
+template:
+ metadata:
+ annotations:
+ sidecar.istio.io/proxyCPU: "512m" # to handle a injected istio sidecar
+ labels:
+ app.kubernetes.io/name: "druid-realtime-backend"
+ spec:
+ affinity: {}
+ containers:
+ - command:
+ - sh
+ - -c
+ - |
+ /peon.sh /druid/data 1
+ env:
+ - name: CUSTOM_ENV_VARIABLE
+ value: "hello"
+ image: apache/druid:{{DRUIDVERSION}}
+ name: main
+ ports:
+ - containerPort: 8091
+ name: druid-tls-port
+ protocol: TCP
+ - containerPort: 8100
+ name: druid-port
+ protocol: TCP
+ resources:
+ limits:
+ cpu: "1"
+ memory: 2400M
+ requests:
+ cpu: "1"
+ memory: 2400M
+ volumeMounts:
+ - mountPath: /opt/druid/conf/druid/cluster/master/coordinator-overlord #
runtime props are still mounted in this location because that's where peon.sh
looks for configs
+ name: nodetype-config-volume
+ readOnly: true
+ - mountPath: /druid/data
+ name: data-volume
+ - mountPath: /druid/deepstorage
+ name: deepstorage-volume
+ restartPolicy: "Never"
+ securityContext:
+ fsGroup: 1000
+ runAsGroup: 1000
+ runAsUser: 1000
+ tolerations:
+ - effect: NoExecute
+ key: node.kubernetes.io/not-ready
+ operator: Exists
+ tolerationSeconds: 300
+ - effect: NoExecute
+ key: node.kubernetes.io/unreachable
+ operator: Exists
+ tolerationSeconds: 300
+ volumes:
+ - configMap:
+ defaultMode: 420
+ name: druid-tiny-cluster-peons-config
+ name: nodetype-config-volume
+ - emptyDir: {}
+ name: data-volume
+ - emptyDir: {}
+ name: deepstorage-volume
+```
+
+The below runtime properties need to be passed to the Job's peon process.
+
+```
+druid.port=8100 (what port the peon should run on)
+druid.peon.mode=remote
+druid.service=druid/peon (for metrics reporting)
+druid.indexer.task.baseTaskDir=/druid/data (this should match the argument to
the ./peon.sh run command in the PodTemplate)
+druid.indexer.runner.type=k8s
+druid.indexer.task.encapsulatedTask=true
+```
+
+Any runtime property or JVM config used by the peon process can also be
passed. E.G. below is a example of a ConfigMap that can be used to generate the
`nodetype-config-volume` mount in the above template.
+```
+kind: ConfigMap
+metadata:
+ name: druid-tiny-cluster-peons-config
+ namespace: default
+apiVersion: v1
+data:
+ jvm.config: |-
+ -server
+ -XX:MaxDirectMemorySize=1000M
+ -Duser.timezone=UTC
+ -Dfile.encoding=UTF-8
+ -Dlog4j.debug
+ -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
+ -Djava.io.tmpdir=/druid/data
+ -Xmx1024M
+ -Xms1024M
+ log4j2.xml: |-
+ <?xml version="1.0" encoding="UTF-8" ?>
+ <Configuration status="WARN">
+ <Appenders>
+ <Console name="Console" target="SYSTEM_OUT">
+ <PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/>
+ </Console>
+ </Appenders>
+ <Loggers>
+ <Root level="info">
+ <AppenderRef ref="Console"/>
+ </Root>
+ </Loggers>
+ </Configuration>
+ runtime.properties: |
+ druid.port=8100
+ druid.service=druid/peon
+ druid.server.http.numThreads=5
+ druid.indexer.task.baseTaskDir=/druid/data
+ druid.indexer.runner.type=k8s
+ druid.peon.mode=remote
+ druid.indexer.task.encapsulatedTask=true
+```
+
+### Properties
+|Property| Possible Values | Description
|Default|required|
+|--------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|--------|
+|`druid.indexer.runner.debugJobs`| `boolean` | Clean up K8s jobs after
tasks complete.
|False|No|
+|`druid.indexer.runner.sidecarSupport`| `boolean` | Deprecated, specify
adapter type as runtime property `druid.indexer.runner.k8s.adapter.type:
overlordMultiContainer` instead. If your overlord pod has sidecars, this will
attempt to start the task with the same sidecars as the overlord pod. |False|No|
+|`druid.indexer.runner.primaryContainerName`| `String` | If running
with sidecars, the `primaryContainerName` should be that of your druid
container like `druid-overlord`.
|First container in `podSpec` list|No|
+|`druid.indexer.runner.kubexitImage`| `String` | Used kubexit project
to help shutdown sidecars when the main pod completes. Otherwise jobs with
sidecars never terminate.
|karlkfi/kubexit:latest|No|
+|`druid.indexer.runner.disableClientProxy`| `boolean` | Use this if you
have a global http(s) proxy and you wish to bypass it.
|false|No|
+|`druid.indexer.runner.maxTaskDuration`| `Duration` | Max time a task is
allowed to run for before getting killed
|`PT4H`|No|
+|`druid.indexer.runner.taskCleanupDelay`| `Duration` | How long do jobs
stay around before getting reaped from K8s
|`P2D`|No|
+|`druid.indexer.runner.taskCleanupInterval`| `Duration` | How often to
check for jobs to be reaped
|`PT10M`|No|
+|`druid.indexer.runner.K8sjobLaunchTimeout`| `Duration` | How long to
wait to launch a K8s task before marking it as failed, on a resource
constrained cluster it may take some time.
|`PT1H`|No|
+|`druid.indexer.runner.javaOptsArray`| `JsonArray` | java opts for the
task.
|`-Xmx1g`|No|
+|`druid.indexer.runner.labels`| `JsonObject` | Additional labels you want
to add to peon pod
|`{}`|No|
+|`druid.indexer.runner.annotations`| `JsonObject` | Additional annotations
you want to add to peon pod
|`{}`|No|
+|`druid.indexer.runner.peonMonitors`| `JsonArray` | Overrides
`druid.monitoring.monitors`. Use this property if you don't want to inherit
monitors from the Overlord.
|`[]`|No|
+|`druid.indexer.runner.graceTerminationPeriodSeconds`| `Long` |
Number of seconds you want to wait after a sigterm for container lifecycle
hooks to complete. Keep at a smaller value if you want tasks to hold locks for
shorter periods.
|`PT30S` (K8s default)|No|
+|`druid.indexer.runner.capacity`| `Integer` | Number of concurrent jobs
that can be sent to Kubernetes.
|`2147483647`|No|
+
+### Gotchas
+
+- All Druid Pods belonging to one Druid cluster must be inside the same
Kubernetes namespace.
+
+- You must have a role binding for the overlord's service account that
provides the needed permissions for interacting with Kubernetes. An example
spec could be:
```
-apiVersion: rbac.authorization.k8s.io/v1
kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
metadata:
- name: druid-cluster
+ namespace: <druid-namespace>
+ name: druid-k8s-task-scheduler
rules:
-- apiGroups:
- - ""
- - batch
- resources:
- - pods
- - configmaps
- - jobs
- verbs:
- - '*'
+ - apiGroups: ["batch"]
+ resources: ["jobs"]
+ verbs: ["get", "watch", "list", "delete", "create"]
+ - apiGroups: [""]
+ resources: ["pods", "pods/log"]
+ verbs: ["get", "watch", "list", "delete", "create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
- name: druid-cluster
+ name: druid-k8s-binding
+ namespace: <druid-namespace>
subjects:
-- kind: ServiceAccount
- name: default
+ - kind: ServiceAccount
+ name: <druid-overlord-k8s-service-account>
+ namespace: <druid-namespace>
roleRef:
kind: Role
- name: druid-cluster
+ name: druid-k8s-task-scheduler
apiGroup: rbac.authorization.k8s.io
-```
+```
\ No newline at end of file
diff --git a/extensions-contrib/kubernetes-overlord-extensions/README.md
b/extensions-contrib/kubernetes-overlord-extensions/README.md
new file mode 100644
index 0000000000..7ebc704ecb
--- /dev/null
+++ b/extensions-contrib/kubernetes-overlord-extensions/README.md
@@ -0,0 +1,32 @@
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+druid-kubernetes-overlord-extensions
+=============
+
+Overview
+=============
+The Kubernetes Task Scheduling extension allows a Druid cluster running on
Kubernetes to schedule
+its tasks as Kubernetes Jobs instead of sending them to workers (middle
managers or indexers).
+
+Documentation
+=============
+More detailed documentation about how to configure and use the extension is
available [here](../../docs/development/extensions-contrib/k8s-jobs.md)
+
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]