This is an automated email from the ASF dual-hosted git repository.
damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 31b8a9e3dfe Duet AI Transform service SDK Harness prompts (#30499)
31b8a9e3dfe is described below
commit 31b8a9e3dfe74d0d54c4c44452de3ea2e3826025
Author: Oleh Borysevych <[email protected]>
AuthorDate: Mon Mar 11 15:11:20 2024 +0100
Duet AI Transform service SDK Harness prompts (#30499)
* transform service prompt added
* SDK harness added
* fix whitespace
* Apply suggestions from code review
Co-authored-by: Daria Bezkorovaina
<[email protected]>
---------
Co-authored-by: Daria Bezkorovaina
<[email protected]>
---
.../documentation-lookup-nolinks/48_sdk_harness.md | 17 +++++++++++
.../49_transform_service.md | 34 ++++++++++++++++++++++
.../prompts/documentation-lookup/48_sdk_harness.md | 19 ++++++++++++
.../documentation-lookup/49_transform_service.md | 34 ++++++++++++++++++++++
4 files changed, 104 insertions(+)
diff --git a/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
new file mode 100644
index 00000000000..3611ac78fd4
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
@@ -0,0 +1,17 @@
+Prompt:
+What is SDK harness in Apache Beam?
+
+Response:
+Interoperability between Apache Beam SDKs and runners is a key aspect of
Apache Beam. The portability framework introduces well-defined,
language-neutral data structures and protocols between the SDK and runner. This
interoperability layer, known as the 'portability API', ensures that SDKs and
runners can seamlessly work with each other, reducing the interoperability
burden for both SDKs and runners to a constant effort.
+
+The portability API consists of a set of smaller contracts that isolate SDKs
and runners for job submission, management, and execution. These contracts
utilize protocols like `protobuf` and `gRPC` to provide broad language support.
Currently, all SDKs support the portability framework.
+
+The SDK harness is a program responsible for executing user code. This program
is provided by an SDK and runs separately from the runner. SDK harness
initialization relies on the provision and artifact APIs for obtaining staged
files, pipeline options, and environment information.
+
+Apache Beam offers configuration options for the SDK harness to cater to
diverse cluster setups. These options include:
+1. **`environment_type`**: determines where user code is executed. The
`environment_config` parameter configures the environment based on the value of
`environment_type`:
+ * `DOCKER`: executes user code within a container on each worker node.
Docker must be installed on worker nodes. You can specify the Docker image URL
using the `environment_config` parameter. Prebuilt SDK container images are
available with each Apache Beam release and pushed to Docker Hub. You can also
build your custom image.
+ * `PROCESS`: executes user code through processes that are automatically
initiated by the runner on each worker node.
+ * `EXTERNAL`: dispatches user code to an external service. Use the
`environment_config` parameter to specify the service address, for example,
`localhost:50000`.
+ * `LOOPBACK`: executes user code within the same process that submitted the
pipeline.
+2. **`sdk_worker_parallelism`**: determines the number of SDK workers per
worker node. The default value is 1, but setting it to 0 enables automatic
determination by the runner based on factors like the number of CPU cores on
the worker machine.
\ No newline at end of file
diff --git
a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md
b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md
new file mode 100644
index 00000000000..1be6b89b6f7
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md
@@ -0,0 +1,34 @@
+Prompt:
+What is Transform service in Apache Beam?
+
+Response:
+The Transform service is a Docker Compose service included in Apache Beam SDK
versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK
version of individual supported transforms in your pipeline without changing
the overall Beam version of the pipeline. Additionally, you can utilize the
Transform service to create and execute multi-language pipelines without
needing to install support for additional language runtimes.
+
+You can utilize the Transform service to upgrade specific transforms only if
you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a
Java pipeline with additional pipeline options specifying the URNs of the
transforms you want to upgrade and the desired Apache Beam version:
+
+```java
+--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1
--transformServiceBeamVersion=2.xy.z
+```
+
+In the provided example, `--transformsToOverride` specifies the URN of the
transform to upgrade or downgrade, while `--transformServiceBeamVersion`
specifies the target Beam version.
+
+The framework automatically downloads the specified version of Docker
containers for the transforms and uses them in the pipeline. You must have
Docker installed on the machine that starts the service.
+
+Currently, the Transform service can upgrade the following transforms:
+* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
+* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
+* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
+* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`
+
+The Transform service implements the Beam expansion API, enabling
multi-language pipelines to leverage it for expanding supported transforms.
This feature allows you to create and run multi-language pipelines without
additional language runtimes. For instance, you can build a Python pipeline
that utilizes a Java `KafkaIO` transform without the need to install Java
locally.
+
+In some cases, Apache Beam SDKs can start the Transform service automatically,
provided that Docker is available locally. You can also start the Transform
service manually by running the following command:
+
+```java
+java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar
--port <port> --beam_version <Beam version for the transform service>
--project_name <a unique ID for the transform service> --command up
+```
+
+The Beam Transform service includes several portable transforms implemented in
the Apache Beam Java and Python SDKs:
+* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and
the JDBC I/O connector.
+* Python transforms: all portable transforms implemented within the Apache
Beam Python SDK, such as RunInference and DataFrame transforms.
+
diff --git a/learning/prompts/documentation-lookup/48_sdk_harness.md
b/learning/prompts/documentation-lookup/48_sdk_harness.md
new file mode 100644
index 00000000000..084c5518200
--- /dev/null
+++ b/learning/prompts/documentation-lookup/48_sdk_harness.md
@@ -0,0 +1,19 @@
+Prompt:
+What is SDK harness in Apache Beam?
+
+Response:
+Interoperability between Apache Beam SDKs and runners is a key aspect of
Apache Beam. The [portability
framework](https://beam.apache.org/roadmap/portability/) introduces
well-defined, language-neutral data structures and protocols between the SDK
and runner. This interoperability layer, known as the 'portability API',
ensures that SDKs and runners can seamlessly work with each other, reducing the
interoperability burden for both SDKs and runners to a constant effort.
+
+The portability API consists of a set of smaller contracts that isolate SDKs
and runners for job submission, management, and execution. These contracts
utilize protocols like `protobuf` and `gRPC` to provide broad language support.
+
+Currently, all SDKs support the portability framework. For the latest
information on portability support across SDKs, features, and runners, refer to
the [Apache Beam Portability Support
Matrix](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0).
+
+The SDK harness is a program responsible for executing user code. This program
is provided by an SDK and runs separately from the runner. SDK harness
initialization relies on the provision and artifact APIs for obtaining staged
files, pipeline options, and environment information.
+
+Apache Beam offers configuration options for the SDK harness to cater to
diverse cluster setups. These options include:
+1. **`environment_type`**: determines where user code is executed. The
`environment_config` parameter configures the environment based on the value of
`environment_type`:
+ * `DOCKER`: executes user code within a container on each worker node.
Docker must be installed on worker nodes. You can specify the Docker image URL
using the `environment_config` parameter. Prebuilt SDK container images are
available with each Apache Beam release and pushed to [Docker
Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). You can also
[build your custom
image](https://beam.apache.org/documentation/runtime/environments/).
+ * `PROCESS`: executes user code through processes that are automatically
initiated by the runner on each worker node.
+ * `EXTERNAL`: dispatches user code to an external service. Use the
`environment_config` parameter to specify the service address, for example,
`localhost:50000`.
+ * `LOOPBACK`: executes user code within the same process that submitted the
pipeline.
+2. **`sdk_worker_parallelism`**: determines the number of SDK workers per
worker node. The default value is 1, but setting it to 0 enables automatic
determination by the runner based on factors like the number of CPU cores on
the worker machine.
\ No newline at end of file
diff --git a/learning/prompts/documentation-lookup/49_transform_service.md
b/learning/prompts/documentation-lookup/49_transform_service.md
new file mode 100644
index 00000000000..6f803b9b657
--- /dev/null
+++ b/learning/prompts/documentation-lookup/49_transform_service.md
@@ -0,0 +1,34 @@
+Prompt:
+What is Transform service in Apache Beam?
+
+Response:
+The Transform service is a [Docker Compose](https://docs.docker.com/compose/)
service included in Apache Beam SDK versions 2.49.0 and later. It enables you
to upgrade or downgrade the Beam SDK version of individual supported transforms
in your pipeline without changing the overall Beam version of the pipeline.
Additionally, you can utilize the Transform service to create and execute
multi-language pipelines without needing to install support for additional
language runtimes.
+
+You can utilize the Transform service to upgrade specific transforms only if
you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a
Java pipeline with additional pipeline options specifying the URNs of the
transforms you want to upgrade and the desired Apache Beam version:
+
+```java
+--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1
--transformServiceBeamVersion=2.xy.z
+```
+
+In the provided example, `--transformsToOverride` specifies the URN of the
transform to upgrade or downgrade, while `--transformServiceBeamVersion`
specifies the target Beam version.
+
+The framework automatically downloads the specified version of Docker
containers for the transforms and uses them in the pipeline. You must have
Docker installed on the machine that starts the service.
+
+Currently, the Transform service can upgrade the following transforms:
+* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
+* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
+* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
+* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`
+
+The Transform service implements the Beam expansion API, enabling
multi-language pipelines to leverage it for expanding supported transforms.
This feature allows you to create and run multi-language pipelines without
additional language runtimes. For instance, you can build a Python pipeline
that utilizes a Java `KafkaIO` transform without the need to install Java
locally.
+
+In some cases, Apache Beam SDKs can start the Transform service automatically,
provided that Docker is available locally. You can also start the Transform
service manually by running the following command:
+
+```java
+java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar
--port <port> --beam_version <Beam version for the transform service>
--project_name <a unique ID for the transform service> --command up
+```
+
+The Beam Transform service includes several portable transforms implemented in
the Apache Beam Java and Python SDKs:
+* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and
the JDBC I/O connector.
+* Python transforms: all portable transforms implemented within the Apache
Beam Python SDK, such as RunInference and DataFrame transforms.
+