Re: [PR] Add yaml to programming guide. [beam]

via GitHub Wed, 28 Feb 2024 08:44:13 -0800


Polber commented on code in PR #30269:
URL: https://github.com/apache/beam/pull/30269#discussion_r1506183891



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -1535,7 +1670,13 @@ automatically apply some optimizations:
 
 ##### 4.2.4.1. Simple combinations using simple functions {#simple-combines}
 
+<span class="language-yaml">
+Beam YAML has the following buit-in CombineFns: count, sum, min, max,
+mean, any, all, group, and concat.
+CombineFns from other languages can also be referenced.

Review Comment:
   Should there be a mention of aggregation being experimental in YAML?



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -46,6 +46,11 @@ The [Go 
SDK](https://pkg.go.dev/github.com/apache/beam/sdks/v2/go/pkg/beam) supp
 The Typescript SDK supports Node v16+ and is still experimental.
 {{< /paragraph >}}
 
+{{< paragraph class="language-yaml">}}
+Yaml is supported as of Beam 2.52, but is under active development and the most

Review Comment:
   ```suggestion
   YAML is supported as of Beam 2.52, but is under active development and the 
most
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -93,6 +98,11 @@ include:
 * I/O transforms: Beam comes with a number of "IOs" - library `PTransform`s 
that
   read or write data to various external storage systems.
 
+<span class="language-yaml">
+Note that in Beam YAML `PCollection`s are either implicit (e.g. when using

Review Comment:
   nit:
   ```suggestion
   Note that in Beam YAML, `PCollection`s are either implicit (e.g. when using
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -3129,6 +3310,16 @@ Unfortunately type information in Typescript is not 
propagated to the runtime la
 so it needs to be manually specified in some places (e.g. when using 
cross-language pipelines).
 {{< /paragraph >}}
 
+{{< paragraph class="language-yaml">}}
+In Beam YAML, all transforms produce and accept schema'd data which is used to 
validate the pipeline.
+{{< /paragraph >}}
+
+{{< paragraph class="language-yaml">}}
+In some cases Beam is unable to figure out the output type of a mapping 
function.
+In this case you can specify it manually using

Review Comment:
   nit:
   ```suggestion
   In some cases, Beam is unable to figure out the output type of a mapping 
function.
   In this case, you can specify it manually using
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -628,6 +694,47 @@ the transform itself as an argument, and the operation 
returns the output
 [Output PCollection] = await [Input PCollection].applyAsync([AsyncTransform])
 {{< /highlight >}}
 
+{{< highlight yaml >}}
+pipeline:
+  transforms:
+    ...
+    - name: ProducingTransform
+      type: ProducingTransformType
+      ...
+
+    - name: MyTransform
+      type: MyTransformType
+      input: ProducingTransform.output_name
+      ...
+{{< /highlight >}}
+
+{{< paragraph class="language-yaml">}}
+The `.output_name` designation can be omitted for those transforms
+with a single (non-error) output.
+{{< /highlight >}}

Review Comment:
   This seems a bit confusing - it reads like it is typical to specify the 
output, when in practice this is rarely done (except error handling). What if 
the logic was reversed by having the example be omitting the named output, and 
then mentioning that transforms with multiple outputs can use the 
`.output_name` notation (and link to error_handling doc)



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -350,6 +383,11 @@ After you've created your `Pipeline`, you'll need to begin 
by creating at least
 one `PCollection` in some form. The `PCollection` you create serves as the 
input
 for the first operation in your pipeline.
 
+<span class="language-yaml">
+In Beam YAML `PCollection`s are either implicit (e.g. when using `chain`)

Review Comment:
   nit: ```suggestion
   In Beam YAML, `PCollection`s are either implicit (e.g. when using `chain`)
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -796,9 +903,18 @@ input into a different format; you might also use `ParDo` 
to convert processed
 data into a format suitable for output, like database table rows or printable
 strings.
 
+{{< paragraph class="language-java language-go language-py 
language-typescript">}}
 When you apply a `ParDo` transform, you'll need to provide user code in the 
form
 of a `DoFn` object. `DoFn` is a Beam SDK class that defines a distributed
 processing function.
+{{< /paragraph >}}
+
+{{< paragraph class="language-yaml">}}
+In Beam YAML `ParDo` operations are expressed by the `MapToFields`, `Filter`,
+and `Explode` transform types which may take a UDF in the language of your
+choice rather than introducing the notion of a `DoFn`.

Review Comment:
   nit:
   ```suggestion
   In Beam YAML, `ParDo` operations are expressed by the `MapToFields`, 
`Filter`, 
   and `Explode` transform types. These types can take a UDF in the language of 
your 
   choice, rather than introducing the notion of a `DoFn`.
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -202,6 +220,12 @@ One can either construct one manually, but it is also 
common to pass an object
 created from command line options such as `yargs.argv`.
 {{< /paragraph >}}
 
+{{< paragraph class="language-yaml">}}
+Pipeline options are simply an optional yaml mapping property that is a 
sibling to

Review Comment:
   nit: ```suggestion
   Pipeline options are simply an optional YAML mapping property that is a 
sibling to
   ```



##########
website/www/site/content/en/documentation/programming-guide.md:
##########
@@ -367,19 +405,29 @@ adapters](#pipeline-io). The adapters vary in their exact 
usage, but all of them
 read from some external data source and return a `PCollection` whose elements
 represent the data records in that source.
 
-Each data source adapter has a `Read` transform; to read, you must apply that
-transform to the `Pipeline` object itself.
+Each data source adapter has a `Read` transform; to read,
+<span class="language-java language-py language-go language-typescript">
+you must apply that transform to the `Pipeline` object itself.
+</span>
+<span class="language-yaml">
+place this transform in the `source` or `transforms` portion of the pipeline.
+</span>
 <span class="language-java">`TextIO.Read`</span>
 <span class="language-py">`io.TextFileSource`</span>
 <span class="language-go">`textio.Read`</span>
 <span class="language-typescript">`textio.ReadFromText`</span>,
+<span class="language-yaml">`ReadFromText`</span>,
 for example, reads from an
-external text file and returns a `PCollection` whose elements are of type
-`String`, each `String` represents one line from the text file. Here's how you
+external text file and returns a `PCollection` whose elements
+<span class="language-java language-py language-go language-typescript">
+are of type `String`, each `String`

Review Comment:
   ```suggestion
   are of type `String` where each `String`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add yaml to programming guide. [beam]

Reply via email to