[GitHub] [beam] alxp1982 commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

via GitHub Wed, 03 May 2023 15:54:30 -0700


alxp1982 commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1183144757



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing 
steps in the appropriate programming languages. For example, you could use Java 
to process some data, Python to process some other data, and Go to perform a 
specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of 
your pipeline together. Apache Beam provides a common data model and 
serialization format, so data can be passed seamlessly between different 
languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the 
strengths of different programming languages, while still building a unified 
data processing pipeline. This can be especially useful when working with large 
datasets, as different languages may have different performance characteristics 
for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding 
SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go 
SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary 
dependencies for each language. For instance, you'll need the Beam Python SDK 
for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a 
pipeline object using the respective SDK. This pipeline will serve as the main 
entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other 
languages, use the ExternalTransform class (in Python) or the External class 
(in Java). This allows you to use a transform written in another language as if 
it were a native transform in your main pipeline. You'll need to provide the 
appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion 
service to expand composite transforms. You must have at least one expansion 
service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on 
your system, you can let `PythonExternalTransform` handle the details of 
creating and starting up the expansion service. But if you want to customize 
the environment or use transforms not available in the default Beam SDK, you 
might need to run your own expansion service.
+
+For example, to start the standard expansion service for a Python transform, 
ExpansionServiceServicer, follow these steps:
+
+* Activate a new virtual environment following these instructions.
+
+* Install Apache Beam with gcp and dataframe packages.
+```
+pip install 'apache-beam[gcp,dataframe]'
+```
+* Run the following command
+```
+python -m apache_beam.runners.portability.expansion_service_main -p <PORT> 
--fully_qualified_name_glob "*"
+```
+
+The command runs `expansion_service_main.py`, which starts the standard 
expansion service. When you use Gradle to run your Java pipeline, you can 
specify the expansion service with the `expansionService` option. For example: 
`--expansionService=localhost:<PORT>`.
+
+#### Run Java program
+
+```
+private String getModelLoaderScript() {
+        String s = "from apache_beam.ml.inference.sklearn_inference import 
SklearnModelHandlerNumpy\n";
+        s = s + "from apache_beam.ml.inference.base import 
KeyedModelHandler\n";
+        s = s + "def get_model_handler(model_uri):\n";
+        s = s + "  return 
KeyedModelHandler(SklearnModelHandlerNumpy(model_uri))\n";
+
+        return s;
+}
+
+void runExample(SklearnMnistClassificationOptions options, String 
expansionService) {
+        // Schema of the output PCollection Row type to be provided to the 
RunInference transform.
+        Schema schema =
+                Schema.of(
+                        Schema.Field.of("example", 
Schema.FieldType.array(Schema.FieldType.INT64)),
+                        Schema.Field.of("inference", FieldType.STRING));
+
+        Pipeline pipeline = Pipeline.create(options);
+        PCollection<KV<Long, Iterable<Long>>> col =
+                pipeline
+                        .apply(TextIO.read().from(options.getInput()))
+                        .apply(Filter.by(new FilterNonRecordsFn()))
+                        .apply(MapElements.via(new 
RecordsToLabeledPixelsFn()));
+
+        col.apply(RunInference.ofKVs(getModelLoaderScript(), schema, 
VarLongCoder.of())
+                                .withKwarg("model_uri", options.getModelPath())
+                                .withExpansionService(expansionService))
+                .apply(MapElements.via(new FormatOutput()))
+                .apply(TextIO.write().to("out.txt"));
+
+        pipeline.run().waitUntilFinish();
+}
+```
+
+> **Note**. Python extension service may not work on Windows. The **extension 
service** and the **Java pipeline** must run in the same environment. If the 
**extension service** is running in a **docker** container, you should run 
**Java pipeline** inside it.
+
+#### Run in docker
+
+Run **extension service** and **java program**:
+```
+# Use an official Java base image
+FROM openjdk:11
+
+# Install Python
+RUN apt-get update && \
+    apt-get install -y python python-pip
+
+# Set the working directory
+WORKDIR /app
+
+# Copy Java and Python source files to the working directory
+COPY src /app/src
+COPY scripts /app/scripts
+
+# Install Python dependencies
+RUN pip install --upgrade pip \
+RUN pip install apache-beam==2.46.0 \
+RUN pip install --default-timeout=100 torch \
+RUN pip install torchvision \
+RUN pip install pandas \
+RUN pip install scikit-learn
+
+RUN wget 
https://repo1.maven.org/maven2/org/springframework/spring-expression/$SPRING_VERSION/spring-expression-$SPRING_VERSION.jar
 &&\
+    mv spring-expression-$SPRING_VERSION.jar 
/opt/apache/beam/jars/spring-expression.jar
+
+# Compile the Java program
+RUN javac -d /app/bin /app/src/MyJavaProgram.java
+
+# Set the entrypoint to run both the Java program and Python script
+ENTRYPOINT ["bash", "-c", "java -cp /app/bin MyJavaProgram && python -m 
apache_beam.runners.portability.expansion_service_main -p 9090 
--fully_qualified_name_glob=*"]
+```
+{{end}}
+
+
+{{if (eq .Sdk "python")}}
+
+
+Write own java service
+```
+import autovalue.shaded.com.google.auto.service.AutoService;
+import autovalue.shaded.com.google.common.collect.ImmutableMap;
+import org.apache.beam.sdk.expansion.ExternalTransformRegistrar;
+import org.apache.beam.sdk.transforms.Count;
+import org.apache.beam.sdk.transforms.ExternalTransformBuilder;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+import java.util.Map;
+
+public class Task {
+    static class JavaCount extends PTransform<PCollection<String>, 
PCollection<KV<String, Long>>> {
+
+        public JavaCount() {
+        }
+
+        @Override
+        public PCollection<KV<String, Long>> expand(PCollection<String> input) 
{
+            return input
+                    .apply(
+                            "JavaCount", Count.perElement());
+        }
+    }
+
+    static class JavaCountBuilder implements
+            ExternalTransformBuilder<JavaCountConfiguration, 
PCollection<String>, PCollection<KV<String, Long>>> {
+
+        @Override
+        public PTransform<PCollection<String>, PCollection<KV<String, Long>>> 
buildExternal(
+                JavaCountConfiguration configuration) {
+            return new JavaCount();
+        }
+    }
+
+    static class JavaCountConfiguration {
+
+    }
+
+    @AutoService(ExternalTransformRegistrar.class)
+    public class JavaCountRegistrar implements ExternalTransformRegistrar {
+
+        final String URN = "my.beam.transform.javacount";
+
+        @Override
+        public Map<String, ExternalTransformBuilder<?, ?, ?>> 
knownBuilderInstances() {
+            return ImmutableMap.of(URN, new JavaCountBuilder());
+        }
+    }
+}
+```
+
+Build jar:
+```
+mvn clean
+mvn package -DskipTests
+cd target
+java -jar java-count-bundled-0.1.jar 12345
+```
+
+Python pipeline:
+```
+import logging
+import re
+import typing
+
+import apache_beam as beam
+from apache_beam.io import ReadFromText
+from apache_beam.io import WriteToText
+from apache_beam.transforms.external import ImplicitSchemaPayloadBuilder
+from apache_beam.options.pipeline_options import PipelineOptions
+
+
+class WordExtractingDoFn(beam.DoFn):
+  """Parse each line of input text into words."""
+  def process(self, element):
+    """Returns an iterator over the words of this element.
+    The element is a line of text.  If the line is blank, note that, too.
+    Args:
+      element: the element being processed
+    Returns:
+      The processed element.
+    """
+    return re.findall(r'[\w\']+', element, re.UNICODE)
+
+
+def run(input_path, output_path, pipeline_args):
+  pipeline_options = PipelineOptions(pipeline_args)
+
+  with beam.Pipeline(options=pipeline_options) as p:
+    lines = p | 'Read' >> ReadFromText(input_path).with_output_types(str)
+    words = lines | 'Split' >> 
(beam.ParDo(WordExtractingDoFn()).with_output_types(str))
+
+    java_output = (
+        words
+        | 'JavaCount' >> beam.ExternalTransform(
+              'my.beam.transform.javacount',
+              None,
+              "localhost:12345"))
+
+    def format(kv):
+      key, value = kv
+      return '%s:%s' % (key, value)
+
+    output = java_output | 'Format' >> beam.Map(format)
+    output | 'Write' >> WriteToText(output_path)
+
+
+if __name__ == '__main__':
+  logging.getLogger().setLevel(logging.INFO)
+  import argparse
+
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--input',
+      dest='input',
+      required=True,
+      help='Input file')
+  parser.add_argument(
+      '--output',
+      dest='output',
+      required=True,
+      help='Output file')
+  known_args, pipeline_args = parser.parse_known_args()
+
+  run(
+      known_args.input,
+      known_args.output,
+      pipeline_args)
+```
+
+Run program
+```
+python javacount.py --runner DirectRunner --environment_type=DOCKER --input 
input1 --output output --sdk_harness_container_image_overrides 
".*java.*,chamikaramj/beam_java11_sdk:latest"
+```
+{{end}}

Review Comment:
   Runnable example and challenge is missing



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.

Review Comment:
   Apache Beam supports multiple programming languages, including Java, Python, 
and Go. With multi-language support, you can use a combination of supported 
languages in a single pipeline. This makes it possible to use the language that 
best suits your needs for each part of the pipeline.
   



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your 
data within the pipeline using SQL-like syntax. This is particularly useful if 
you're already familiar with SQL and want to leverage that knowledge while 
processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and 
optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following 
steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are 
performed on `PCollection<Row>` objects. You'll need to convert your data into 
a `PCollection` of rows, which requires defining a schema that describes the 
structure of your data.

Review Comment:
   Create a `PCollection` of rows. Apache Beam performs **SQL** operations on 
`PCollection<Row>` objects. You'll need to convert your data into a 
`PCollection` of rows, which requires defining a schema that describes the 
structure of your data.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.

Review Comment:
   Should be a list? 



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing 
steps in the appropriate programming languages. For example, you could use Java 
to process some data, Python to process some other data, and Go to perform a 
specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of 
your pipeline together. Apache Beam provides a common data model and 
serialization format, so data can be passed seamlessly between different 
languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the 
strengths of different programming languages, while still building a unified 
data processing pipeline. This can be especially useful when working with large 
datasets, as different languages may have different performance characteristics 
for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding 
SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go 
SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary 
dependencies for each language. For instance, you'll need the Beam Python SDK 
for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a 
pipeline object using the respective SDK. This pipeline will serve as the main 
entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other 
languages, use the ExternalTransform class (in Python) or the External class 
(in Java). This allows you to use a transform written in another language as if 
it were a native transform in your main pipeline. You'll need to provide the 
appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion 
service to expand composite transforms. You must have at least one expansion 
service per remote SDK.

Review Comment:
   When building a job for a multi-language pipeline, Beam uses an expansion 
service to expand composite transforms. Therefore, you must have at least one 
expansion service per remote SDK.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing 
steps in the appropriate programming languages. For example, you could use Java 
to process some data, Python to process some other data, and Go to perform a 
specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of 
your pipeline together. Apache Beam provides a common data model and 
serialization format, so data can be passed seamlessly between different 
languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the 
strengths of different programming languages, while still building a unified 
data processing pipeline. This can be especially useful when working with large 
datasets, as different languages may have different performance characteristics 
for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding 
SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go 
SDKs.

Review Comment:
   Should be a list in the markdown outlining each task?



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your 
data within the pipeline using SQL-like syntax. This is particularly useful if 
you're already familiar with SQL and want to leverage that knowledge while 
processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and 
optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following 
steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are 
performed on `PCollection<Row>` objects. You'll need to convert your data into 
a `PCollection` of rows, which requires defining a schema that describes the 
structure of your data.

Review Comment:
   List? 



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your 
data within the pipeline using SQL-like syntax. This is particularly useful if 
you're already familiar with SQL and want to leverage that knowledge while 
processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and 
optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following 
steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are 
performed on `PCollection<Row>` objects. You'll need to convert your data into 
a `PCollection` of rows, which requires defining a schema that describes the 
structure of your data.
+
+Apply the SQL transform. Once you have your `PCollection<Row>`, you can apply 
a SQL transform using the `SqlTransform.query()` method. You'll need to provide 
the SQL query you want to execute on your data.
+
+{{if (eq .Sdk "java")}}
+Add the necessary dependencies to your project. For Java, you'll need to 
include the `beam-sdks-java-extensions-sql` dependency in your build 
configuration.
+
+```
+import org.apache.beam.sdk.extensions.sql.SqlTransform;
+import org.apache.beam.sdk.schemas.Schema;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.Row;
+
+// Define the schema for your data
+Schema schema = Schema.builder()
+        .addField("id", Schema.FieldType.INT32)
+        .addField("name", Schema.FieldType.STRING)
+        .build();
+
+// Assume input is a PCollection<Row> with the above schema
+PCollection<Row> input = ...;
+
+// Apply the SQL transform
+PCollection<Row> result = input.apply(
+        SqlTransform.query("SELECT id, name FROM PCOLLECTION WHERE id > 100"));
+```
+{{end}}
+
+{{if (eq .Sdk "python")}}
+```
+import apache_beam as beam
+from apache_beam.transforms.sql import SqlTransform
+
+# Define a sample input data as a list of dictionaries
+input_data = [
+    {'id': 1, 'name': 'Alice'},
+    {'id': 2, 'name': 'Bob'},
+    {'id': 101, 'name': 'Carol'},
+    {'id': 102, 'name': 'David'},
+]
+
+# Create a pipeline
+with beam.Pipeline() as p:
+    # Read input data and convert it to a PCollection of Rows
+    input_rows = (
+        p | 'Create input data' >> beam.Create(input_data)
+        | 'Convert to Rows' >> beam.Map(lambda x: beam.Row(id=int(x['id']), 
name=str(x['name'])))
+    )
+
+    # Apply the SQL transform
+    filtered_rows = input_rows | SqlTransform("SELECT id, name FROM 
PCOLLECTION WHERE id > 100")
+
+    # Print the results
+    filtered_rows | 'Print results' >> beam.Map(print)
+```
+{{end}}

Review Comment:
   Runnable example description and challenge?



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing 
steps in the appropriate programming languages. For example, you could use Java 
to process some data, Python to process some other data, and Go to perform a 
specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of 
your pipeline together. Apache Beam provides a common data model and 
serialization format, so data can be passed seamlessly between different 
languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the 
strengths of different programming languages, while still building a unified 
data processing pipeline. This can be especially useful when working with large 
datasets, as different languages may have different performance characteristics 
for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding 
SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go 
SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary 
dependencies for each language. For instance, you'll need the Beam Python SDK 
for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a 
pipeline object using the respective SDK. This pipeline will serve as the main 
entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other 
languages, use the ExternalTransform class (in Python) or the External class 
(in Java). This allows you to use a transform written in another language as if 
it were a native transform in your main pipeline. You'll need to provide the 
appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion 
service to expand composite transforms. You must have at least one expansion 
service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on 
your system, you can let `PythonExternalTransform` handle the details of 
creating and starting up the expansion service. But if you want to customize 
the environment or use transforms not available in the default Beam SDK, you 
might need to run your own expansion service.

Review Comment:
   In most cases, if you have a supported version of Python installed on your 
system, you can let `PythonExternalTransform` handle the details of creating 
and starting up the expansion service. But if you want to customize the 
environment or use transforms not available in the default Beam SDK, you might 
need to run your own expansion service.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming 
data processing pipelines. One of the key features of Apache Beam is its 
ability to support multi-language pipelines. With Apache Beam, you can write 
different parts of your pipeline in different programming languages, and they 
can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, 
and Go. This makes it possible to use the language that best suits your needs 
for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following 
approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming 
language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing 
steps in the appropriate programming languages. For example, you could use Java 
to process some data, Python to process some other data, and Go to perform a 
specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of 
your pipeline together. Apache Beam provides a common data model and 
serialization format, so data can be passed seamlessly between different 
languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the 
strengths of different programming languages, while still building a unified 
data processing pipeline. This can be especially useful when working with large 
datasets, as different languages may have different performance characteristics 
for different tasks.
+

Review Comment:
   Use Apache Beam's cross-language support to connect the different parts of 
your pipeline. Apache Beam provides a standard data model and serialization 
format so that data can be passed seamlessly between different languages.
                
   
                Using Apache Beam's multi-language support, you can take 
advantage of the strengths of different programming languages while building a 
unified data processing pipeline. This can be especially useful when working 
with large datasets and re-using existing components. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] alxp1982 commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Reply via email to