[beam] branch master updated: [Documentation] Update docs to run SparkPipelineRunner on a Kubernetes cluster (closes #27984)

mmack Fri, 01 Sep 2023 05:37:14 -0700

This is an automated email from the ASF dual-hosted git repository.

mmack pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git



The following commit(s) were added to refs/heads/master by this push:
     new 0b4302e5f95 [Documentation] Update docs to run SparkPipelineRunner on 
a Kubernetes cluster (closes #27984)
0b4302e5f95 is described below

commit 0b4302e5f95f2dc9b6658c13d5d1aa798cfba668
Author: Hao Xu <[email protected]>
AuthorDate: Fri Sep 1 05:33:12 2023 -0700

    [Documentation] Update docs to run SparkPipelineRunner on a Kubernetes 
cluster (closes #27984)
---
 .../site/content/en/documentation/runners/spark.md | 47 +++++++++++++++++++++-
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/website/www/site/content/en/documentation/runners/spark.md 
b/website/www/site/content/en/documentation/runners/spark.md
index dcc166873dc..29ef5c28102 100644
--- a/website/www/site/content/en/documentation/runners/spark.md
+++ b/website/www/site/content/en/documentation/runners/spark.md
@@ -487,5 +487,48 @@ Provided SparkContext and StreamingListeners are not 
supported on the Spark port
 {{< /paragraph >}}
 
 ### Kubernetes
-
-An [example](https://github.com/cometta/python-apache-beam-spark) of 
configuring Spark to run Apache beam job
+#### Submit beam job without job server
+To submit a beam job directly on spark kubernetes cluster without spinning up 
an extra job server, you can do:
+```
+spark-submit --master MASTER_URL \
+  --conf spark.kubernetes.driver.podTemplateFile=driver_pod_template.yaml \
+  --conf spark.kubernetes.executor.podTemplateFile=executor_pod_template.yaml \
+  --class org.apache.beam.runners.spark.SparkPipelineRunner \
+  --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
+  ./wc_job.jar
+```
+Similar to run the beam job on Dataproc, you can bundle the job jar like 
below. The example use the `PROCESS` type of [SDK 
harness](https://beam.apache.org/documentation/runtime/sdk-harness-config/) to 
execute the job by processes.
+```
+python -m beam_example_wc \
+    --runner=SparkRunner \
+    --output_executable_path=./wc_job.jar \
+    --environment_type=PROCESS \
+    --environment_config='{\"command\": \"/opt/apache/beam/boot\"}' \
+    --spark_version=3
+```
+
+And below is an example of kubernetes executor pod template, the 
`initContainer` is required to download the beam SDK harness to run the beam 
pipelines.
+```
+spec:
+  containers:
+    - name: spark-kubernetes-executor
+      volumeMounts:
+      - name: beam-data
+        mountPath: /opt/apache/beam/
+  initContainers:
+  - name: init-beam
+    image: apache/beam_python3.7_sdk
+    command:
+    - cp
+    - /opt/apache/beam/boot
+    - /init-container/data/boot
+    volumeMounts:
+    - name: beam-data
+      mountPath: /init-container/data
+  volumes:
+  - name: beam-data
+    emptyDir: {}
+```
+
+#### Submit beam job with job server
+An [example](https://github.com/cometta/python-apache-beam-spark) of 
configuring Spark to run Apache beam job with a job server.

[beam] branch master updated: [Documentation] Update docs to run SparkPipelineRunner on a Kubernetes cluster (closes #27984)

Reply via email to