[GitHub] [beam] damccorm commented on a diff in pull request #23619: Adds a Java RunInference example

GitBox Fri, 14 Oct 2022 12:07:26 -0700


damccorm commented on code in PR #23619:
URL: https://github.com/apache/beam/pull/23619#discussion_r996051507



##########
examples/multi-language/README.md:
##########
@@ -22,29 +22,147 @@
 This project provides examples of Apache Beam
 [multi-language 
pipelines](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines):
 
+## Using Java transforms from Python
+
 * **python/addprefix** - A Python pipeline that reads a text file and attaches 
a prefix on the Java side to each input.
 * **python/javacount** - A Python pipeline that counts words using the Java 
`Count.perElement()` transform.
 * **python/javadatagenerator** - A Python pipeline that produces a set of 
strings generated from Java.
                                   This example demonstrates the 
`JavaExternalTransform` API.
 
-## Instructions for running the pipelines
+### Instructions for running the pipelines
 
-### 1) Start the expansion service
+#### 1) Start the expansion service
 
 1. Download the latest 'beam-examples-multi-language' JAR. Starting with 
Apache Beam 2.36.0,
    you can find it in [the Maven Central 
Repository](https://search.maven.org/search?q=g:org.apache.beam).
 2. Run the following command, replacing `<version>` and `<port>` with valid 
values:
   `java -jar beam-examples-multi-language-<version>.jar <port> 
--javaClassLookupAllowlistFile='*'`
 
-### 2) Set up a Python virtual environment for Beam
+#### 2) Set up a Python virtual environment for Beam
 
 1. See [the Python 
quickstart](https://beam.apache.org/get-started/quickstart-py/)
    for more information.
 
-### 3) Execute the Python pipeline
+#### 3) Execute the Python pipeline
 
 1. In a new shell, run a pipeline in the **python** directory using a Beam 
runner that supports
    multi-language pipelines.
 
    The Python files contain details about the actual commands to run.
 
+## Using Python transforms from Java
+
+### Sklearn Mnist Classification
+
+Performs image classification on handwritten digits from the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database)
+database.
+
+Please see 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference)
 for
+context and information regarding the corresponding Python pipeline.
+
+Please note that the Java pipeline is
+[availalble in the Beam Java examples 
module](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/SklearnMnistClassification.java).
+
+#### Setup
+
+* Obtain/generate a csv input file that contains labels and pixels to feed 
into the model and store it in
+GCS. And example input is available [here](TODO).

Review Comment:
   ```suggestion
   GCS. An example input is available 
[here](https://pantheon.corp.google.com/storage/browser/_details/apache-beam-samples/multi-language/mnist/example_input.csv;tab=live_object).
   ```



##########
examples/multi-language/README.md:
##########
@@ -22,29 +22,147 @@
 This project provides examples of Apache Beam
 [multi-language 
pipelines](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines):
 
+## Using Java transforms from Python
+
 * **python/addprefix** - A Python pipeline that reads a text file and attaches 
a prefix on the Java side to each input.
 * **python/javacount** - A Python pipeline that counts words using the Java 
`Count.perElement()` transform.
 * **python/javadatagenerator** - A Python pipeline that produces a set of 
strings generated from Java.
                                   This example demonstrates the 
`JavaExternalTransform` API.
 
-## Instructions for running the pipelines
+### Instructions for running the pipelines
 
-### 1) Start the expansion service
+#### 1) Start the expansion service
 
 1. Download the latest 'beam-examples-multi-language' JAR. Starting with 
Apache Beam 2.36.0,
    you can find it in [the Maven Central 
Repository](https://search.maven.org/search?q=g:org.apache.beam).
 2. Run the following command, replacing `<version>` and `<port>` with valid 
values:
   `java -jar beam-examples-multi-language-<version>.jar <port> 
--javaClassLookupAllowlistFile='*'`
 
-### 2) Set up a Python virtual environment for Beam
+#### 2) Set up a Python virtual environment for Beam
 
 1. See [the Python 
quickstart](https://beam.apache.org/get-started/quickstart-py/)
    for more information.
 
-### 3) Execute the Python pipeline
+#### 3) Execute the Python pipeline
 
 1. In a new shell, run a pipeline in the **python** directory using a Beam 
runner that supports
    multi-language pipelines.
 
    The Python files contain details about the actual commands to run.
 
+## Using Python transforms from Java
+
+### Sklearn Mnist Classification
+
+Performs image classification on handwritten digits from the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database)
+database.
+
+Please see 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference)
 for
+context and information regarding the corresponding Python pipeline.
+
+Please note that the Java pipeline is
+[availalble in the Beam Java examples 
module](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/SklearnMnistClassification.java).
+
+#### Setup
+
+* Obtain/generate a csv input file that contains labels and pixels to feed 
into the model and store it in
+GCS. And example input is available [here](TODO).
+
+* Create a model file that contains the pickled file of a scikit-learn model
+trained on MNIST data and store it in GCS. An example model file is available 
[here](TODO).

Review Comment:
   ```suggestion
   trained on MNIST data and store it in GCS. An example model file is 
available 
[here](https://pantheon.corp.google.com/storage/browser/_details/apache-beam-samples/multi-language/mnist/example_model;tab=live_object).
   ```



##########
examples/multi-language/README.md:
##########
@@ -22,29 +22,147 @@
 This project provides examples of Apache Beam
 [multi-language 
pipelines](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines):
 
+## Using Java transforms from Python
+
 * **python/addprefix** - A Python pipeline that reads a text file and attaches 
a prefix on the Java side to each input.
 * **python/javacount** - A Python pipeline that counts words using the Java 
`Count.perElement()` transform.
 * **python/javadatagenerator** - A Python pipeline that produces a set of 
strings generated from Java.
                                   This example demonstrates the 
`JavaExternalTransform` API.
 
-## Instructions for running the pipelines
+### Instructions for running the pipelines
 
-### 1) Start the expansion service
+#### 1) Start the expansion service
 
 1. Download the latest 'beam-examples-multi-language' JAR. Starting with 
Apache Beam 2.36.0,
    you can find it in [the Maven Central 
Repository](https://search.maven.org/search?q=g:org.apache.beam).
 2. Run the following command, replacing `<version>` and `<port>` with valid 
values:
   `java -jar beam-examples-multi-language-<version>.jar <port> 
--javaClassLookupAllowlistFile='*'`
 
-### 2) Set up a Python virtual environment for Beam
+#### 2) Set up a Python virtual environment for Beam
 
 1. See [the Python 
quickstart](https://beam.apache.org/get-started/quickstart-py/)
    for more information.
 
-### 3) Execute the Python pipeline
+#### 3) Execute the Python pipeline
 
 1. In a new shell, run a pipeline in the **python** directory using a Beam 
runner that supports
    multi-language pipelines.
 
    The Python files contain details about the actual commands to run.
 
+## Using Python transforms from Java
+
+### Sklearn Mnist Classification
+
+Performs image classification on handwritten digits from the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database)
+database.
+
+Please see 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference)
 for
+context and information regarding the corresponding Python pipeline.
+
+Please note that the Java pipeline is
+[availalble in the Beam Java examples 
module](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/SklearnMnistClassification.java).
+
+#### Setup
+
+* Obtain/generate a csv input file that contains labels and pixels to feed 
into the model and store it in
+GCS. And example input is available [here](TODO).
+
+* Create a model file that contains the pickled file of a scikit-learn model
+trained on MNIST data and store it in GCS. An example model file is available 
[here](TODO).

Review Comment:
   Also, would be good to call out how this one was trained (or pulled from a 
hub)



##########
examples/multi-language/README.md:
##########
@@ -22,29 +22,147 @@
 This project provides examples of Apache Beam
 [multi-language 
pipelines](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines):
 
+## Using Java transforms from Python
+
 * **python/addprefix** - A Python pipeline that reads a text file and attaches 
a prefix on the Java side to each input.
 * **python/javacount** - A Python pipeline that counts words using the Java 
`Count.perElement()` transform.
 * **python/javadatagenerator** - A Python pipeline that produces a set of 
strings generated from Java.
                                   This example demonstrates the 
`JavaExternalTransform` API.
 
-## Instructions for running the pipelines
+### Instructions for running the pipelines
 
-### 1) Start the expansion service
+#### 1) Start the expansion service
 
 1. Download the latest 'beam-examples-multi-language' JAR. Starting with 
Apache Beam 2.36.0,
    you can find it in [the Maven Central 
Repository](https://search.maven.org/search?q=g:org.apache.beam).
 2. Run the following command, replacing `<version>` and `<port>` with valid 
values:
   `java -jar beam-examples-multi-language-<version>.jar <port> 
--javaClassLookupAllowlistFile='*'`
 
-### 2) Set up a Python virtual environment for Beam
+#### 2) Set up a Python virtual environment for Beam
 
 1. See [the Python 
quickstart](https://beam.apache.org/get-started/quickstart-py/)
    for more information.
 
-### 3) Execute the Python pipeline
+#### 3) Execute the Python pipeline
 
 1. In a new shell, run a pipeline in the **python** directory using a Beam 
runner that supports
    multi-language pipelines.
 
    The Python files contain details about the actual commands to run.
 
+## Using Python transforms from Java
+
+### Sklearn Mnist Classification
+
+Performs image classification on handwritten digits from the 
[MNIST](https://en.wikipedia.org/wiki/MNIST_database)
+database.
+
+Please see 
[here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference)
 for
+context and information regarding the corresponding Python pipeline.
+
+Please note that the Java pipeline is
+[availalble in the Beam Java examples 
module](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/SklearnMnistClassification.java).
+
+#### Setup
+
+* Obtain/generate a csv input file that contains labels and pixels to feed 
into the model and store it in
+GCS. And example input is available [here](TODO).
+
+* Create a model file that contains the pickled file of a scikit-learn model
+trained on MNIST data and store it in GCS. An example model file is available 
[here](TODO).
+
+* Perform Beam runner specific setup according to instructions
+[here](https://beam.apache.org/get-started/quickstart-java/#run-a-pipeline).
+
+Following instructions are for running the pipeline with the Dataflow runner. 
For other portable runners,
+please modify the instructions according to the guidelines
+[here](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#run-with-directrunner)
+
+#### Instructions for running the Java pipeline on released Beam (Beam 2.43.0 
and later).
+
+* Checkout the Beam examples Maven archetype for the relevant Beam version.
+
+```
+export BEAM_VERSION=<Beam version>
+
+mvn archetype:generate \
+    -DarchetypeGroupId=org.apache.beam \
+    -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
+    -DarchetypeVersion=$BEAM_VERSION \
+    -DgroupId=org.example \
+    -DartifactId=multi-language-beam \
+    -Dversion="0.1" \
+    -Dpackage=org.apache.beam.examples \
+    -DinteractiveMode=false
+```
+
+* Run the pipeline.
+
+```
+export GCP_PROJECT=<GCP project>
+export GCP_BUCKET=<GCP bucket>
+export GCP_REGION=<GCP region>
+
+mvn compile exec:java 
-Dexec.mainClass=org.apache.beam.examples.multilanguage.SklearnMnistClassification
 \
+    -Dexec.args="--runner=DataflowRunner --project=$GCP_PROJECT \
+                 --region=us-central1 \
+                 --gcpTempLocation=gs://$GCP_BUCKET/multi-language-beam/tmp \
+                 --output=gs://$GCP_BUCKET/multi-language-beam/output" \
+    -Pdataflow-runner
+```
+
+* Inspect the output. Each line has data separated by a comma ",". The first 
item is the actual label of
+the digit. The second item is the predicted label of the digit.
+
+```
+gsutil cat gs://$GCP_BUCKET/multi-language-beam/output*
+```
+
+#### Instructions for running the Java pipeline at HEAD (Beam 2.41.0 and 
2.42.0).
+
+* Make sure that Docker is installed and available on your system.
+
+* Build and push Python and Java Docker containers.
+
+```
+export DOCKER_ROOT=<Docker root>
+
+./gradlew :sdks:python:container:py38:docker 
-Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest
+
+docker push $DOCKER_ROOT/beam_python3.8_sdk:latest
+
+./gradlew :sdks:java:container:java11:docker 
-Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest
+
+docker push $DOCKER_ROOT/beam_java11_sdk:latest
+```
+
+* Run the pipeline using the following Gradle command (this guide assumes 
Dataflow runner).
+Note that we override both the Java and Python SDK harness containers here.
+
+```
+export GCP_PROJECT=<GCP project>
+export GCP_BUCKET=<GCP bucket>
+export GCP_REGION=<GCP region>
+
+./gradlew :examples:multi-language:sklearnMinstClassification --args=" \
+--runner=DataflowRunner \
+--project=$GCP_PROJECT \
+--gcpTempLocation=gs://$GCP_BUCKET/multi-language-beam/tmp \
+--output=gs://$GCP_BUCKET/multi-language-beam/output \
+--sdkContainerImage=$DOCKER_ROOT/beam_java11_sdk:latest \
+--sdkHarnessContainerImageOverrides=.*python.*,$DOCKER_ROOT/beam_python3.8_sdk:latest
 \
+--region=${GCP_REGION}"
+```
+
+* Inspect the output. Each line has data separated by a comma ",". The first 
item is the actual label
+of the digit. The second item is the predicted label of the digit.
+
+```
+gsutil cat gs://$GCP_BUCKET/multi-language-beam/output*
+```
+
+
+
+
+
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #23619: Adds a Java RunInference example

Reply via email to