(beam) branch master updated: Duet AI Transform service SDK Harness prompts (#30499)

damccorm Mon, 11 Mar 2024 07:11:33 -0700

This is an automated email from the ASF dual-hosted git repository.

damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git



The following commit(s) were added to refs/heads/master by this push:
     new 31b8a9e3dfe Duet AI Transform service SDK Harness prompts (#30499)
31b8a9e3dfe is described below

commit 31b8a9e3dfe74d0d54c4c44452de3ea2e3826025
Author: Oleh Borysevych <[email protected]>
AuthorDate: Mon Mar 11 15:11:20 2024 +0100

    Duet AI Transform service SDK Harness prompts (#30499)
    
    * transform service prompt added
    
    * SDK harness added
    
    * fix whitespace
    
    * Apply suggestions from code review
    
    Co-authored-by: Daria Bezkorovaina 
<[email protected]>
    
    ---------
    
    Co-authored-by: Daria Bezkorovaina 
<[email protected]>
---
 .../documentation-lookup-nolinks/48_sdk_harness.md | 17 +++++++++++
 .../49_transform_service.md                        | 34 ++++++++++++++++++++++
 .../prompts/documentation-lookup/48_sdk_harness.md | 19 ++++++++++++
 .../documentation-lookup/49_transform_service.md   | 34 ++++++++++++++++++++++
 4 files changed, 104 insertions(+)

diff --git a/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md 
b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
new file mode 100644
index 00000000000..3611ac78fd4
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
@@ -0,0 +1,17 @@
+Prompt:
+What is SDK harness in Apache Beam?
+
+Response:
+Interoperability between Apache Beam SDKs and runners is a key aspect of 
Apache Beam. The portability framework introduces well-defined, 
language-neutral data structures and protocols between the SDK and runner. This 
interoperability layer, known as the 'portability API', ensures that SDKs and 
runners can seamlessly work with each other, reducing the interoperability 
burden for both SDKs and runners to a constant effort.
+
+The portability API consists of a set of smaller contracts that isolate SDKs 
and runners for job submission, management, and execution. These contracts 
utilize protocols like `protobuf` and `gRPC` to provide broad language support. 
Currently, all SDKs support the portability framework.
+
+The SDK harness is a program responsible for executing user code. This program 
is provided by an SDK and runs separately from the runner. SDK harness 
initialization relies on the provision and artifact APIs for obtaining staged 
files, pipeline options, and environment information.
+
+Apache Beam offers configuration options for the SDK harness to cater to 
diverse cluster setups. These options include:
+1. **`environment_type`**: determines where user code is executed. The 
`environment_config` parameter configures the environment based on the value of 
`environment_type`:
+  * `DOCKER`: executes user code within a container on each worker node. 
Docker must be installed on worker nodes. You can specify the Docker image URL 
using the `environment_config` parameter. Prebuilt SDK container images are 
available with each Apache Beam release and pushed to Docker Hub. You can also 
build your custom image.
+  * `PROCESS`: executes user code through processes that are automatically 
initiated by the runner on each worker node.
+  * `EXTERNAL`: dispatches user code to an external service. Use the 
`environment_config` parameter to specify the service address, for example, 
`localhost:50000`.
+  * `LOOPBACK`: executes user code within the same process that submitted the 
pipeline.
+2. **`sdk_worker_parallelism`**: determines the number of SDK workers per 
worker node. The default value is 1, but setting it to 0 enables automatic 
determination by the runner based on factors like the number of CPU cores on 
the worker machine.
\ No newline at end of file
diff --git 
a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md 
b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md
new file mode 100644
index 00000000000..1be6b89b6f7
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md
@@ -0,0 +1,34 @@
+Prompt:
+What is Transform service in Apache Beam?
+
+Response:
+The Transform service is a Docker Compose service included in Apache Beam SDK 
versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK 
version of individual supported transforms in your pipeline without changing 
the overall Beam version of the pipeline. Additionally, you can utilize the 
Transform service to create and execute multi-language pipelines without 
needing to install support for additional language runtimes.
+
+You can utilize the Transform service to upgrade specific transforms only if 
you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a 
Java pipeline with additional pipeline options specifying the URNs of the 
transforms you want to upgrade and the desired Apache Beam version:
+
+```java
+--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 
--transformServiceBeamVersion=2.xy.z
+```
+
+In the provided example, `--transformsToOverride` specifies the URN of the 
transform to upgrade or downgrade, while `--transformServiceBeamVersion` 
specifies the target Beam version.
+
+The framework automatically downloads the specified version of Docker 
containers for the transforms and uses them in the pipeline. You must have 
Docker installed on the machine that starts the service.
+
+Currently, the Transform service can upgrade the following transforms:
+* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
+* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
+* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
+* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`
+
+The Transform service implements the Beam expansion API, enabling 
multi-language pipelines to leverage it for expanding supported transforms. 
This feature allows you to create and run multi-language pipelines without 
additional language runtimes. For instance, you can build a Python pipeline 
that utilizes a Java `KafkaIO` transform without the need to install Java 
locally.
+
+In some cases, Apache Beam SDKs can start the Transform service automatically, 
provided that Docker is available locally. You can also start the Transform 
service manually by running the following command:
+
+```java
+java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar 
--port <port> --beam_version <Beam version for the transform service> 
--project_name <a unique ID for the transform service> --command up
+```
+
+The Beam Transform service includes several portable transforms implemented in 
the Apache Beam Java and Python SDKs:
+* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and 
the JDBC I/O connector.
+* Python transforms: all portable transforms implemented within the Apache 
Beam Python SDK, such as RunInference and DataFrame transforms.
+
diff --git a/learning/prompts/documentation-lookup/48_sdk_harness.md 
b/learning/prompts/documentation-lookup/48_sdk_harness.md
new file mode 100644
index 00000000000..084c5518200
--- /dev/null
+++ b/learning/prompts/documentation-lookup/48_sdk_harness.md
@@ -0,0 +1,19 @@
+Prompt:
+What is SDK harness in Apache Beam?
+
+Response:
+Interoperability between Apache Beam SDKs and runners is a key aspect of 
Apache Beam. The [portability 
framework](https://beam.apache.org/roadmap/portability/) introduces 
well-defined, language-neutral data structures and protocols between the SDK 
and runner. This interoperability layer, known as the 'portability API', 
ensures that SDKs and runners can seamlessly work with each other, reducing the 
interoperability burden for both SDKs and runners to a constant effort.
+
+The portability API consists of a set of smaller contracts that isolate SDKs 
and runners for job submission, management, and execution. These contracts 
utilize protocols like `protobuf` and `gRPC` to provide broad language support.
+
+Currently, all SDKs support the portability framework. For the latest 
information on portability support across SDKs, features, and runners, refer to 
the [Apache Beam Portability Support 
Matrix](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0).
+
+The SDK harness is a program responsible for executing user code. This program 
is provided by an SDK and runs separately from the runner. SDK harness 
initialization relies on the provision and artifact APIs for obtaining staged 
files, pipeline options, and environment information.
+
+Apache Beam offers configuration options for the SDK harness to cater to 
diverse cluster setups. These options include:
+1. **`environment_type`**: determines where user code is executed. The 
`environment_config` parameter configures the environment based on the value of 
`environment_type`:
+  * `DOCKER`: executes user code within a container on each worker node. 
Docker must be installed on worker nodes. You can specify the Docker image URL 
using the `environment_config` parameter. Prebuilt SDK container images are 
available with each Apache Beam release and pushed to [Docker 
Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). You can also 
[build your custom 
image](https://beam.apache.org/documentation/runtime/environments/).
+  * `PROCESS`: executes user code through processes that are automatically 
initiated by the runner on each worker node.
+  * `EXTERNAL`: dispatches user code to an external service. Use the 
`environment_config` parameter to specify the service address, for example, 
`localhost:50000`.
+  * `LOOPBACK`: executes user code within the same process that submitted the 
pipeline.
+2. **`sdk_worker_parallelism`**: determines the number of SDK workers per 
worker node. The default value is 1, but setting it to 0 enables automatic 
determination by the runner based on factors like the number of CPU cores on 
the worker machine.
\ No newline at end of file
diff --git a/learning/prompts/documentation-lookup/49_transform_service.md 
b/learning/prompts/documentation-lookup/49_transform_service.md
new file mode 100644
index 00000000000..6f803b9b657
--- /dev/null
+++ b/learning/prompts/documentation-lookup/49_transform_service.md
@@ -0,0 +1,34 @@
+Prompt:
+What is Transform service in Apache Beam?
+
+Response:
+The Transform service is a [Docker Compose](https://docs.docker.com/compose/) 
service included in Apache Beam SDK versions 2.49.0 and later. It enables you 
to upgrade or downgrade the Beam SDK version of individual supported transforms 
in your pipeline without changing the overall Beam version of the pipeline. 
Additionally, you can utilize the Transform service to create and execute 
multi-language pipelines without needing to install support for additional 
language runtimes.
+
+You can utilize the Transform service to upgrade specific transforms only if 
you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a 
Java pipeline with additional pipeline options specifying the URNs of the 
transforms you want to upgrade and the desired Apache Beam version:
+
+```java
+--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 
--transformServiceBeamVersion=2.xy.z
+```
+
+In the provided example, `--transformsToOverride` specifies the URN of the 
transform to upgrade or downgrade, while `--transformServiceBeamVersion` 
specifies the target Beam version.
+
+The framework automatically downloads the specified version of Docker 
containers for the transforms and uses them in the pipeline. You must have 
Docker installed on the machine that starts the service.
+
+Currently, the Transform service can upgrade the following transforms:
+* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
+* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
+* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
+* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`
+
+The Transform service implements the Beam expansion API, enabling 
multi-language pipelines to leverage it for expanding supported transforms. 
This feature allows you to create and run multi-language pipelines without 
additional language runtimes. For instance, you can build a Python pipeline 
that utilizes a Java `KafkaIO` transform without the need to install Java 
locally.
+
+In some cases, Apache Beam SDKs can start the Transform service automatically, 
provided that Docker is available locally. You can also start the Transform 
service manually by running the following command:
+
+```java
+java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar 
--port <port> --beam_version <Beam version for the transform service> 
--project_name <a unique ID for the transform service> --command up
+```
+
+The Beam Transform service includes several portable transforms implemented in 
the Apache Beam Java and Python SDKs:
+* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and 
the JDBC I/O connector.
+* Python transforms: all portable transforms implemented within the Apache 
Beam Python SDK, such as RunInference and DataFrame transforms.
+

(beam) branch master updated: Duet AI Transform service SDK Harness prompts (#30499)

Reply via email to