rszper commented on code in PR #29507: URL: https://github.com/apache/beam/pull/29507#discussion_r1411430656
########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). Review Comment: ```suggestion 1. Each element is paired with a restriction. For example, filename is paired with offset range representing the whole file). ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. Review Comment: ```suggestion When you apply a splittable `DoFn` to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). +2. Each element and restriction pair is split (e.g. offset ranges are broken up into smaller pieces). +3. The runner redistributes the element and restriction pairs to several workers. +4. Element and restriction pairs are processed in parallel (e.g. the file is read). Within this last step, the element and restriction pair can pause its own processing and/or be split into further element and restriction pairs. + +See Tour of Beam [Splittable DoFn module](https://tour.beam.apache.org/tour/python/splittable-dofn/splittable) for a practical example. + +See [community blogpost](https://beam.apache.org/blog/splittable-do-fn-is-available/) for more information. Review Comment: ```suggestion For more information, see the [community blog post](https://beam.apache.org/blog/splittable-do-fn-is-available/). ``` ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. Review Comment: ```suggestion Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a multi-language pipeline might read from a Python source, process the data using a Java transform, and write the data to a Python sink. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? Review Comment: ```suggestion - What does the data look like? ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). Review Comment: ```suggestion Beam documentation has more information about [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and about common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections +- Run the pipeline + +Beam documentation has more on [developing](https://beam.apache.org/documentation/programming-guide/) and [executing](https://beam.apache.org/documentation/pipelines/create-your-pipeline/) pipelines. Review Comment: ```suggestion The Apache Beam documentation has more information about [developing](https://beam.apache.org/documentation/programming-guide/) and [executing](https://beam.apache.org/documentation/pipelines/create-your-pipeline/) pipelines. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection Review Comment: ```suggestion - Apply transforms to each `PCollection`. ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming + +See [here](https://beam.apache.org/documentation/patterns/ai-platform/) for common AI Platform integration patterns in Apache Beam. + +The recommended way to implement inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). See [here](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb) for more details on how to use RunInference for PyTorch, scikit-learn, and TensorFlow. Review Comment: ```suggestion The recommended way to implement inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). For more information about how to use RunInference for PyTorch, scikit-learn, and TensorFlow, see the [Use RunInference in Apache Beam](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb) example in GitHub. ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. Review Comment: ```suggestion Yes, Apache Beam integrates with Pub/Sub. [Pub/Sub](https://cloud.google.com/pubsub) is a [Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service. Use Pub/Sub to send and receive messages between independent applications. Apache Beam provides a `PubSubIO` connector that lets you read and write messages to and from Pub/Sub. ``` ########## learning/prompts/documentation-lookup/14_advanced_pipeline_patterns.md: ########## @@ -0,0 +1,13 @@ +Prompt: +What pipeline patterns exist in Apache Beam? +Response: +Beam pipeline patterns are a set of best practices for building Beam pipelines. They are based on real-world Beam deployments and are designed to help you build resilient, flexible, and portable Beam pipelines. + +Here are some of the most common pipeline patterns: Review Comment: ```suggestion The following list includes some of the most common pipeline patterns: ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). +2. Each element and restriction pair is split (e.g. offset ranges are broken up into smaller pieces). Review Comment: ```suggestion 2. Each element and restriction pair is split. For example, offset ranges are broken up into smaller pieces. ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). +2. Each element and restriction pair is split (e.g. offset ranges are broken up into smaller pieces). +3. The runner redistributes the element and restriction pairs to several workers. +4. Element and restriction pairs are processed in parallel (e.g. the file is read). Within this last step, the element and restriction pair can pause its own processing and/or be split into further element and restriction pairs. Review Comment: ```suggestion 4. Element and restriction pairs are processed in parallel. For example, the file is read. Within this last step, the element and restriction pair can pause its own processing or be split into further element and restriction pairs. ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. Review Comment: ```suggestion A splittable `DoFn` (SDF) is a generalization of [`DoFn`](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. A splittable `DoFn` makes it easier to create complex, modular I/O connectors in Beam . ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: +1. Each element is paired with a restriction (e.g. filename is paired with offset range representing the whole file). +2. Each element and restriction pair is split (e.g. offset ranges are broken up into smaller pieces). +3. The runner redistributes the element and restriction pairs to several workers. +4. Element and restriction pairs are processed in parallel (e.g. the file is read). Within this last step, the element and restriction pair can pause its own processing and/or be split into further element and restriction pairs. + +See Tour of Beam [Splittable DoFn module](https://tour.beam.apache.org/tour/python/splittable-dofn/splittable) for a practical example. Review Comment: ```suggestion See Tour of Beam [Splittable DoFn module](https://tour.beam.apache.org/tour/python/splittable-dofn/splittable) for a practical example. ``` ```suggestion For an example, see the [Splittable DoFn module](https://tour.beam.apache.org/tour/python/splittable-dofn/splittable) in the Tour of Beam. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. Review Comment: ```suggestion During each iteration, you might need to go back and forth between the different steps to refine your pipeline code and to fix bugs. ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: Review Comment: ```suggestion The following Apache Beam SDKs support the `BigQueryIO` connector: ``` ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. + +For example, the [Apache Kafka connector](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py) and [SQL transform](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/sql.py) from the Java SDK can be used in Python pipelines. Review Comment: ```suggestion For example, you can use the [Apache Kafka connector](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py) and [SQL transform](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/sql.py) from the Java SDK in Python pipelines. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: Review Comment: ```suggestion To design a pipeline, you need answers to the following questions: ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. + +At a high level, an SDF is responsible for processing element and restriction pairs. A restriction represents a subset of work that would have been necessary to have been done when processing the element. + +Executing an [Splittable DoFn](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) follows the following steps: Review Comment: ```suggestion Executing a [Splittable `DoFn`](https://beam.apache.org/documentation/programming-guide/#splittable-dofns) uses the following steps: ``` ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. + +For example, the [Apache Kafka connector](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py) and [SQL transform](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/sql.py) from the Java SDK can be used in Python pipelines. + +See quickstart examples for [Java](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines) and [Python](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines) to learn how to create a multi-language pipeline. + +Depending on the SDK language of the pipeline, you can use a high-level SDK-wrapper class, or a low-level transform class to access a cross-language transform. See [Using cross-language transforms](https://beam.apache.org/documentation/programming-guide/#use-x-lang-transforms) section of Apache Beam Documentation. + +Developing a cross-language transform involves defining a Uniform Resourse Name(URN) for registering the transform with an expansion service. See [Defining a URN](https://beam.apache.org/documentation/programming-guide/#1314-defining-a-urn) for additional information and examples. Review Comment: ```suggestion To develop a cross-language transform, you need to define a Uniform Resourse Name(URN) for registering the transform with an expansion service. For more information, see [Defining a URN](https://beam.apache.org/documentation/programming-guide/#1314-defining-a-urn). ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) via X-language Review Comment: ```suggestion * [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) through X Language ``` ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. + +For example, the [Apache Kafka connector](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py) and [SQL transform](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/sql.py) from the Java SDK can be used in Python pipelines. + +See quickstart examples for [Java](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines) and [Python](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines) to learn how to create a multi-language pipeline. Review Comment: ```suggestion To learn how to create a multi-language pipeline, see the quickstart examples for [Java](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines) and [Python](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines). ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. +PubSub is currently supported only in streaming pipelines. + + +PubSub is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.pubsub.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/pubsubio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/pubsub.ts) via X-language + +[Dataflow-cookbook repository](https://github.com/GoogleCloudPlatform/dataflow-cookbook) will help you to get started with PubSub and Apache Beam. See here for [read](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/read_pubsub_multiple.py) and [write](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/write_pubsub.py) examples in Python. Review Comment: ```suggestion To get started with Pub/Sub and Apache Beam, see the [Dataflow-cookbook repository](https://github.com/GoogleCloudPlatform/dataflow-cookbook) in GitHub. For Python read examples, see the [read_pubsub_multiple.py ](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/read_pubsub_multiple.py) example. For Python write examples, see the [write_pubsub.py](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/write_pubsub.py) example. ``` ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. Review Comment: We also have a Go SDK; not sure if it's relevant here. ########## learning/prompts/documentation-lookup/15_advanced_xlang.md: ########## @@ -0,0 +1,15 @@ +Prompt: +What is a multi-language pipeline in Apache Beam? +Response: +Beam lets you combine transforms written in any supported SDK language (currently, [Java](https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms) and [Python](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms)) and use them in one multi-language pipeline. For example, a pipeline that reads from a Python source processes the data using a Java transform, and writes the data to a Python sink is a multi-language pipeline. + +For example, the [Apache Kafka connector](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py) and [SQL transform](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/sql.py) from the Java SDK can be used in Python pipelines. + +See quickstart examples for [Java](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines) and [Python](https://beam.apache.org/documentation/sdks/python-multi-language-pipelines) to learn how to create a multi-language pipeline. + +Depending on the SDK language of the pipeline, you can use a high-level SDK-wrapper class, or a low-level transform class to access a cross-language transform. See [Using cross-language transforms](https://beam.apache.org/documentation/programming-guide/#use-x-lang-transforms) section of Apache Beam Documentation. Review Comment: ```suggestion Depending on the SDK language of the pipeline, you can use a high-level SDK-wrapper class or a low-level transform class to access a cross-language transform. For more information, see [Using cross-language transforms](https://beam.apache.org/documentation/programming-guide/#use-x-lang-transforms). ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) via X-language + +`ReadFromBigQuery` is used to read data from BigQuery. Data can be read from a BigQuery table or using a SQL query. The default mode is to return table rows read from a BigQuery source as dictionaries. Native `TableRow` objects can also be returned if desired. + +Reading from BigQuery in its simplest form could be something like: + +```python +from apache_beam.io.gcp.bigquery import ReadFromBigQuery + +with beam.Pipeline(options=options) as p: + # read from a table + lines_table = p | 'Read' >> ReadFromBigQuery(table=table) + # read from a query + lines_query = p | 'Read' >> ReadFromBigQuery(query="SELECT * FROM table") + +``` +Writing to BigQuery in its simplest form could be something like: + +```python +from apache_beam.io.gcp.bigquery import WriteToBigQuery + +with beam.Pipeline(options=options) as p: + # write to a table + p | 'Write' >> beam.io.WriteToBigQuery( + table, + schema=TABLE_SCHEMA, + create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, + write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) +``` +[Dataflow-cookbook repository](https://github.com/GoogleCloudPlatform/dataflow-cookbook) will help you to get started with BigQuery and Apache Beam. See here for [read](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/bigquery/read_table_bigquery.py) and [write](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/bigquery/write_bigquery.py) examples in Python. Review Comment: ```suggestion To get started with BigQuery and Apache Beam, see the examples in the [Dataflow-cookbook repository](https://github.com/GoogleCloudPlatform/dataflow-cookbook) in GitHub. For Python read examples, see the [read_table_bigquery.py](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/bigquery/read_table_bigquery.py) example. For Python write examples, see the [write_bigquery.py](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/bigquery/write_bigquery.py) example. ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) via X-language + +`ReadFromBigQuery` is used to read data from BigQuery. Data can be read from a BigQuery table or using a SQL query. The default mode is to return table rows read from a BigQuery source as dictionaries. Native `TableRow` objects can also be returned if desired. + +Reading from BigQuery in its simplest form could be something like: + +```python +from apache_beam.io.gcp.bigquery import ReadFromBigQuery + +with beam.Pipeline(options=options) as p: + # read from a table + lines_table = p | 'Read' >> ReadFromBigQuery(table=table) + # read from a query + lines_query = p | 'Read' >> ReadFromBigQuery(query="SELECT * FROM table") + +``` +Writing to BigQuery in its simplest form could be something like: Review Comment: ```suggestion Apache Beam pipeline code for writing from BigQuery might look like the following example: ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? Review Comment: ```suggestion - Where is the data stored? ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) via X-language + +`ReadFromBigQuery` is used to read data from BigQuery. Data can be read from a BigQuery table or using a SQL query. The default mode is to return table rows read from a BigQuery source as dictionaries. Native `TableRow` objects can also be returned if desired. + +Reading from BigQuery in its simplest form could be something like: Review Comment: ```suggestion Apache Beam pipeline code for reading from BigQuery might look like the following example: ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/bigqueryio.ts) via X-language + +`ReadFromBigQuery` is used to read data from BigQuery. Data can be read from a BigQuery table or using a SQL query. The default mode is to return table rows read from a BigQuery source as dictionaries. Native `TableRow` objects can also be returned if desired. Review Comment: ```suggestion To read data from BigQuery, use `ReadFromBigQuery`. Apache Beam can read data directly from a BigQuery table or using a SQL query. The default mode is to return table rows read from a BigQuery source as dictionaries. Built-in `TableRow` objects can also be returned. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections Review Comment: ```suggestion - Write or otherwise output the final, transformed `PCollection` objects. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object Review Comment: ```suggestion - Create a `Pipeline` object. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: Review Comment: ```suggestion An Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Apache Beam SDKs, your program needs to perform the following steps: ``` ########## learning/prompts/documentation-lookup/13_advanced_splittable_dofn.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What is Splittable DoFn in Apache Beam? +Response: +Splittable DoFn (SDF) is a generalization of [DoFn](https://beam.apache.org/documentation/programming-guide/#pardo) that lets you process elements in a non-monolithic way. Splittable DoFn makes creating complex, modular I/O connectors in Beam easier. +When you apply a splittable DoFn to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. Review Comment: ```suggestion When you apply a splittable `DoFn` to an element, the runner can split the element’s processing into smaller tasks. You can checkpoint the processing of an element, and you can split the remaining work to yield additional parallelism. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections +- Run the pipeline Review Comment: ```suggestion - Run the pipeline. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data Review Comment: ```suggestion - Use a `Read` or `Create` transform to create one or more `PCollection` objects for your pipeline data. ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. +PubSub is currently supported only in streaming pipelines. + + +PubSub is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.pubsub.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/pubsubio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/pubsub.ts) via X-language Review Comment: ```suggestion * [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/pubsub.ts) through X Language. ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections +- Run the pipeline + +Beam documentation has more on [developing](https://beam.apache.org/documentation/programming-guide/) and [executing](https://beam.apache.org/documentation/pipelines/create-your-pipeline/) pipelines. + +Testing pipelines is a particularly important step in developing an effective data processing solution. The indirect nature of the Beam model, in which your user code constructs a pipeline graph to be executed remotely, can make debugging-failed runs a non-trivial task. See [here](https://beam.apache.org/documentation/pipelines/test-your-pipeline/) for more information on pipeline testing strategies. + +Choosing a [runner](https://beam.apache.org/documentation/#choosing-a-runner) is a crucial step in deploying your pipeline. The runner you choose determines where and how your pipeline will execute. Review Comment: ```suggestion Choosing a [runner](https://beam.apache.org/documentation/#choosing-a-runner) is a crucial step in deploying your pipeline. The runner you choose determines where and how your pipeline executes. ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming + +See [here](https://beam.apache.org/documentation/patterns/ai-platform/) for common AI Platform integration patterns in Apache Beam. Review Comment: ```suggestion For common AI platform integration patterns in Apache Beam, see [AI Platform integration patterns](https://beam.apache.org/documentation/patterns/ai-platform/). ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections +- Run the pipeline + +Beam documentation has more on [developing](https://beam.apache.org/documentation/programming-guide/) and [executing](https://beam.apache.org/documentation/pipelines/create-your-pipeline/) pipelines. + +Testing pipelines is a particularly important step in developing an effective data processing solution. The indirect nature of the Beam model, in which your user code constructs a pipeline graph to be executed remotely, can make debugging-failed runs a non-trivial task. See [here](https://beam.apache.org/documentation/pipelines/test-your-pipeline/) for more information on pipeline testing strategies. Review Comment: ```suggestion Testing pipelines is a particularly important step in developing an effective data processing solution. The indirect nature of the Beam model, in which your user code constructs a pipeline graph to be executed remotely, can make debugging-failed runs difficult. For more information about pipeline testing strategies, see [Test Your Pipeline](https://beam.apache.org/documentation/pipelines/test-your-pipeline/). ``` ########## learning/prompts/documentation-lookup/16_advanced_pipeline_lifecycle.md: ########## @@ -0,0 +1,36 @@ +Prompt: +What is a pipeline development lifecycle in Apache Beam? +Response: + +The Apache Beam pipeline development lifecycle is an iterative process that usually involves the following steps: + +- Design your pipeline. +- Develop your pipeline code. +- Test your pipeline. +- Deploy your pipeline. + +On each iteration, you may need to go back and forth between the different steps to refine your pipeline code and fix any bugs you find. + +Designing a pipeline addresses the following questions: +- Where is my data stored? +- What does your data look like? +- What do you want to do with your data? +- What does your output data look like, and where should it go? + +Beam documentation has more information on [pipeline design](https://beam.apache.org/documentation/pipelines/design-your-pipeline/) and common [pipeline patterns](https://beam.apache.org/documentation/patterns/overview/). + + +Apache Beam program expresses a data processing pipeline, from start to finish. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: + +- Create a Pipeline object +- Use a Read or Create transform to create one or more PCollections for your pipeline data +- Apply transforms to each PCollection +- Write or otherwise output the final, transformed PCollections +- Run the pipeline + +Beam documentation has more on [developing](https://beam.apache.org/documentation/programming-guide/) and [executing](https://beam.apache.org/documentation/pipelines/create-your-pipeline/) pipelines. + +Testing pipelines is a particularly important step in developing an effective data processing solution. The indirect nature of the Beam model, in which your user code constructs a pipeline graph to be executed remotely, can make debugging-failed runs a non-trivial task. See [here](https://beam.apache.org/documentation/pipelines/test-your-pipeline/) for more information on pipeline testing strategies. + +Choosing a [runner](https://beam.apache.org/documentation/#choosing-a-runner) is a crucial step in deploying your pipeline. The runner you choose determines where and how your pipeline will execute. +More information on deployment is available [here](https://beam.apache.org/documentation/runtime/environments/). Review Comment: ```suggestion For more information about pipeline deployment, see [Container environments](https://beam.apache.org/documentation/runtime/environments/). ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming Review Comment: ```suggestion - Run your models in production with varying data loads, both in batch and streaming pipelines. ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming + +See [here](https://beam.apache.org/documentation/patterns/ai-platform/) for common AI Platform integration patterns in Apache Beam. + +The recommended way to implement inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). See [here](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb) for more details on how to use RunInference for PyTorch, scikit-learn, and TensorFlow. + +Using pre-trained models in Apache Beam is also supported with [PyTorch](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch.ipynb), [Scikit-learn](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_sklearn.ipynb), and [Tensorflow](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb). Running inference on [custom models](https://beam.apache.org/documentation/ml/about-ml/#use-custom-models) is also supported. + +Apache Beam also supports automatic model refresh, which allows you to update models, hot-swapping them in a running streaming pipeline with no pause in processing the data stream, avoiding downtime. See [here](https://beam.apache.org/documentation/ml/about-ml/#automatic-model-refresh) for more details. +More on Apache Beam ML innovations for production can be found [here](https://cloud.google.com/blog/products/ai-machine-learning/dataflow-ml-innovations-on-apache-beam/). Review Comment: ```suggestion For more information about using machine learning models with Apache Beam, see [Running ML models now easier with new Dataflow ML innovations on Apache Beam](https://cloud.google.com/blog/products/ai-machine-learning/dataflow-ml-innovations-on-apache-beam/). ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming + +See [here](https://beam.apache.org/documentation/patterns/ai-platform/) for common AI Platform integration patterns in Apache Beam. + +The recommended way to implement inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). See [here](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb) for more details on how to use RunInference for PyTorch, scikit-learn, and TensorFlow. + +Using pre-trained models in Apache Beam is also supported with [PyTorch](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch.ipynb), [Scikit-learn](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_sklearn.ipynb), and [Tensorflow](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb). Running inference on [custom models](https://beam.apache.org/documentation/ml/about-ml/#use-custom-models) is also supported. + +Apache Beam also supports automatic model refresh, which allows you to update models, hot-swapping them in a running streaming pipeline with no pause in processing the data stream, avoiding downtime. See [here](https://beam.apache.org/documentation/ml/about-ml/#automatic-model-refresh) for more details. +More on Apache Beam ML innovations for production can be found [here](https://cloud.google.com/blog/products/ai-machine-learning/dataflow-ml-innovations-on-apache-beam/). + +For more hands-on examples of using Apache Beam ML integration, see [here](https://beam.apache.org/documentation/patterns/bqml/) Review Comment: ```suggestion For an example that use the Apache Beam ML integration, see [BigQuery ML integration](https://beam.apache.org/documentation/patterns/bqml/). ``` ########## learning/prompts/documentation-lookup/17_advanced_ai_ml.md: ########## @@ -0,0 +1,18 @@ +Prompt: +What are AI and ML capabilities in Apache Beam? +Response: +Apache Beam has several built-in [AI and ML capabilities](https://beam.apache.org/documentation/ml/overview/) that enable you to: +- Process large datasets for both preprocessing and model inference. +- Conduct exploratory data analysis and smoothly scale up data pipelines in production as part of your MLOps ecosystem. +- Run your models in production with varying data loads, both in batch and streaming + +See [here](https://beam.apache.org/documentation/patterns/ai-platform/) for common AI Platform integration patterns in Apache Beam. + +The recommended way to implement inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). See [here](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb) for more details on how to use RunInference for PyTorch, scikit-learn, and TensorFlow. + +Using pre-trained models in Apache Beam is also supported with [PyTorch](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_pytorch.ipynb), [Scikit-learn](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_sklearn.ipynb), and [Tensorflow](https://github.com/apache/beam/blob/master/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb). Running inference on [custom models](https://beam.apache.org/documentation/ml/about-ml/#use-custom-models) is also supported. + +Apache Beam also supports automatic model refresh, which allows you to update models, hot-swapping them in a running streaming pipeline with no pause in processing the data stream, avoiding downtime. See [here](https://beam.apache.org/documentation/ml/about-ml/#automatic-model-refresh) for more details. Review Comment: ```suggestion Apache Beam also supports automatically updating the model being used with the `RunInference PTransform` in streaming pipelines without stopping the pipeline. The feature lets you avoid downtime downtime. For more information, see [Automatic model refresh](https://beam.apache.org/documentation/ml/about-ml/#automatic-model-refresh). ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. +PubSub is currently supported only in streaming pipelines. Review Comment: ```suggestion Pub/Sub is integrations are supported by streaming pipelines. ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. +PubSub is currently supported only in streaming pipelines. + + +PubSub is supported in the following Beam SDKs: Review Comment: ```suggestion The following Apache Beam SDKs support Pub/Sub: ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. + +BigQueryIO is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and via [X-language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) Review Comment: ```suggestion * [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/bigqueryio) native and though [X Language](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/xlang/bigqueryio) ``` ########## learning/prompts/documentation-lookup/20_io_biguery.md: ########## @@ -0,0 +1,39 @@ +Prompt: +Is BigQuery supported in Apache Beam? +Response: +[BigQuery](https://cloud.google.com/bigquery) is a[ Google Cloud](https://cloud.google.com/) serverless and cost-effective enterprise data warehouse. Apache Beam provides a BigQueryIO connector to read and write data from and to BigQuery. BigQueryIO supports both batch and streaming pipelines. Review Comment: ```suggestion Yes, Apache Beam supports BigQuery. [BigQuery](https://cloud.google.com/bigquery) is a serverless and cost-effective enterprise data warehouse offered by [Google Cloud](https://cloud.google.com/). Apache Beam provides a `BigQueryIO` connector to read and write data to and from BigQuery. The `BigQueryIO` connector supports both batch and streaming pipelines. ``` ########## learning/prompts/documentation-lookup/19_io_pubsub.md: ########## @@ -0,0 +1,25 @@ +Prompt: +Is PubSub supported in Apache Beam? +Response: +[PubSub](https://cloud.google.com/pubsub) is a[ Google Cloud](https://cloud.google.com/) service that provides a simple, reliable, scalable, and secure real-time messaging service for sending and receiving messages between independent applications. Apache Beam provides a PubSubIO connector that allows you to read and write messages from and to PubSub. +PubSub is currently supported only in streaming pipelines. + + +PubSub is supported in the following Beam SDKs: +* [Java](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.html) +* [Python](https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.pubsub.html) +* [Go](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam/io/pubsubio) +* [Typescript](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/io/pubsub.ts) via X-language + +[Dataflow-cookbook repository](https://github.com/GoogleCloudPlatform/dataflow-cookbook) will help you to get started with PubSub and Apache Beam. See here for [read](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/read_pubsub_multiple.py) and [write](https://github.com/GoogleCloudPlatform/dataflow-cookbook/blob/main/Python/pubsub/write_pubsub.py) examples in Python. + +```python Review Comment: We should introduce this code with a description of what it is and why it's here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
