benWize commented on a change in pull request #17233:
URL: https://github.com/apache/beam/pull/17233#discussion_r840779089
##########
File path: website/www/site/content/en/documentation/runners/spark.md
##########
@@ -240,6 +240,82 @@ See [here](/roadmap/portability/#sdk-harness-config) for
details.{{< /paragraph
See [here](/roadmap/portability/#sdk-harness-config) for details.)
{{< /paragraph >}}
+### Running on Dataproc cluster (YARN backed)
+
+To run Beam jobs written in Python, Go, and other supported languages, you can
use the `SparkRunner` and `PortableRunner` as described on the Beam's [Spark
Runner](https://beam.apache.org/documentation/runners/spark/) page (also see
[Portability Framework Roadmap](https://beam.apache.org/roadmap/portability/)).
+
+The following example runs a portable Beam job in Python from the Dataproc
cluster's master node with Yarn backed.
+
+> Note: This example executes successfully with Dataproc 2.0, Spark 2.4.8 and
3.1.2 and Beam 2.37.0.
+
+1. Create a Dataproc cluster with
[Docker](https://cloud.google.com/dataproc/docs/concepts/components/docker)
component enabled.
+
+<pre>
+gcloud dataproc clusters create <b><i>CLUSTER_NAME</i></b> \
+ --optional-components=DOCKER \
+ --image-version=<b><i>DATAPROC_IMAGE_VERSION</i></b> \
+ --region=<b><i>REGION</i></b> \
+ --enable-component-gateway \
+ --scopes=https://www.googleapis.com/auth/cloud-platform \
+ --properties spark:spark.master.rest.enabled=true
+</pre>
+
+- `--optional-components`: Docker.
+- `--image-version`: the [cluster's image
version](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions#supported_cloud_dataproc_versions),
which determines the Spark version installed on the cluster (for example, see
the Apache Spark component versions listed for the latest and previous four
[2.0.x image release
versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0)).
+- `--region`: a supported Dataproc
[region](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints#regional_endpoint_semantics).
+- `--enable-component-gateway`: enable access to [web
interfaces](https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways).
+- `--scopes`: enable API access to GCP services in the same project.
+- `--properties`: add specific configuration for some component, here
spark.master.rest is enabled to use job submit to the cluster.
+
+2. Create a Cloud Storage bucket.
+
+<pre>
+gsutil mb <b><i>BUCKET_NAME</i></b>
+</pre>
+
+3. Install the necessary Python libraries for the job in your local
environment.
+
+<pre>
+python -m pip install apache-beam[gcp]==<b><i>BEAM_VERSION</i></b>
+</pre>
+
+4. Bundle the word count example pipeline along with all dependencies,
artifacts, etc. required to run the pipeline into a jar that can be executed
later
Review comment:
Thanks, added!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]