[GitHub] [beam] benWize commented on a change in pull request #17233: [BEAM-8970] Add docs to run wordcount example on portable Spark Runner

GitBox Fri, 01 Apr 2022 10:17:25 -0700


benWize commented on a change in pull request #17233:
URL: https://github.com/apache/beam/pull/17233#discussion_r840779089




##########
File path: website/www/site/content/en/documentation/runners/spark.md
##########
@@ -240,6 +240,82 @@ See [here](/roadmap/portability/#sdk-harness-config) for 
details.{{< /paragraph
 See [here](/roadmap/portability/#sdk-harness-config) for details.)
 {{< /paragraph >}}
 
+###  Running on Dataproc cluster (YARN backed)
+
+To run Beam jobs written in Python, Go, and other supported languages, you can 
use the `SparkRunner` and `PortableRunner` as described on the Beam's [Spark 
Runner](https://beam.apache.org/documentation/runners/spark/) page (also see 
[Portability Framework Roadmap](https://beam.apache.org/roadmap/portability/)).
+
+The following example runs a portable Beam job in Python from the Dataproc 
cluster's master node with Yarn backed.
+
+> Note: This example executes successfully with Dataproc 2.0, Spark 2.4.8 and 
3.1.2 and Beam 2.37.0.
+
+1. Create a Dataproc cluster with 
[Docker](https://cloud.google.com/dataproc/docs/concepts/components/docker) 
component enabled.
+
+<pre>
+gcloud dataproc clusters create <b><i>CLUSTER_NAME</i></b> \
+    --optional-components=DOCKER \
+    --image-version=<b><i>DATAPROC_IMAGE_VERSION</i></b> \
+    --region=<b><i>REGION</i></b> \
+    --enable-component-gateway \
+    --scopes=https://www.googleapis.com/auth/cloud-platform \
+    --properties spark:spark.master.rest.enabled=true
+</pre>
+
+- `--optional-components`: Docker.
+- `--image-version`: the [cluster's image 
version](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions#supported_cloud_dataproc_versions),
 which determines the Spark version installed on the cluster (for example, see 
the Apache Spark component versions listed for the latest and previous four 
[2.0.x image release 
versions](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0)).
+- `--region`: a supported Dataproc 
[region](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints#regional_endpoint_semantics).
+- `--enable-component-gateway`: enable access to [web 
interfaces](https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways).
+- `--scopes`: enable API access to GCP services in the same project.
+- `--properties`: add specific configuration for some component, here 
spark.master.rest is enabled to use job submit to the cluster.
+
+2. Create a Cloud Storage bucket.
+
+<pre>
+gsutil mb <b><i>BUCKET_NAME</i></b>
+</pre>
+
+3. Install the necessary Python libraries for the job in your local 
environment.
+
+<pre>
+python -m pip install apache-beam[gcp]==<b><i>BEAM_VERSION</i></b>
+</pre>
+
+4. Bundle the word count example pipeline along with all dependencies, 
artifacts, etc. required to run the pipeline into a jar that can be executed 
later

Review comment:
       Thanks, added!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] benWize commented on a change in pull request #17233: [BEAM-8970] Add docs to run wordcount example on portable Spark Runner

Reply via email to