dariabezkorovaina commented on code in PR #30440:
URL: https://github.com/apache/beam/pull/30440#discussion_r1514166163


##########
learning/prompts/documentation-lookup-nolinks/50_beam_yaml.md:
##########
@@ -0,0 +1,72 @@
+Prompt:
+Can I create Apache Beam pipelines using YAML?
+
+Response:
+The Beam YAML API enables the creation of basic Apache Beam pipelines using 
YAML files, which can be edited with any text editor. The API provides an 
intuitive, declarative syntax that simplifies pipeline authoring by eliminating 
the need for coding experience or SDK familiarity. Manipulating and 
understanding YAML semantics is easier compared to working directly with Beam 
protos. Additionally, YAML's readability and ease of authorship make it a 
suitable intermediate representation for tools such as pipeline authoring GUIs 
or lineage analysis tools.
+
+The Beam YAML parser is currently integrated into the Apache Beam Python SDK.
+
+The Beam YAML SDK offers a wide range of built-in transforms and allows the 
creation of custom transforms through the notion of `providers` that leverage 
expansion services and schema transforms, or by integrating Python packages 
into your pipeline. You can give names to transforms to help you monitor, 
debug, and disambiguate identical transforms, enhancing readability.
+
+Beam YAML enables authoring both linear and non-linear pipelines. For linear 
pipelines, Beam YAML allows the inputs to be implicit by specifying the 
pipeline as a `chain` type. Additionally, you can denote the first and last 
transforms as `source` and `sink`, respectively. For non-linear pipelines, 
inputs must be explicitly named, although `chains` can be nested within them.
+
+The following Beam YAML code snippet demonstrates a linear pipeline that reads 
data from a CSV file, filters rows based on a specified condition, executes an 
SQL query, and writes the resulting data to a JSON file:
+
+```yaml
+pipeline:
+  type: chain
+
+  source:
+    type: ReadFromCsv
+    config:
+      path: /path/to/input*.csv
+
+  transforms:
+    - type: Filter
+      config:
+        language: python
+        keep: "col3 > 100"
+
+    - type: Sql
+      name: MySqlTransform
+      config:
+        query: "select col1, count(*) as cnt from PCOLLECTION group by col1"
+
+  sink:
+    type: WriteToJson
+    config:
+      path: /path/to/output.json
+```
+
+With Beam YAML, you can define both streaming and batch pipelines. Beam YAML 
also supports Apache Beam’s windowing and triggering, allowing windowing to be 
declared using the standard `WindowInto` transform or by tagging a transform 
with specified windowing. The latter applies the windowing to its inputs and 
the transform itself.
+
+The following example demonstrates a streaming Beam YAML pipeline with a chain 
of transforms. It reads from a Pub/Sub topic, applies a grouping transform with 
sliding windowing, and then writes the output to another Pub/Sub topic:
+
+```yaml
+pipeline:
+  type: chain
+  transforms:
+    - type: ReadFromPubSub
+      config:
+        topic: myPubSubTopic
+        format: ...
+        schema: ...
+    - type: SomeGroupingTransform
+      config:
+        arg: ...
+      windowing:
+        type: sliding
+        size: 60s
+        period: 10s
+    - type: WriteToPubSub
+      config:
+        topic: anotherPubSubTopic
+        format: json
+```
+
+You can execute a pipeline defined in a YAML file using the standard `python 
-m` command:
+
+```python
+python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml 
[other pipeline options such as the runner]

Review Comment:
   Thank you, Danny! Noted:) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to