Re: [PR] Proposed edits for Beam YAML overview [beam]

via GitHub Thu, 04 Apr 2024 15:30:31 -0700


robertwb commented on code in PR #30842:
URL: https://github.com/apache/beam/pull/30842#discussion_r1552524970



##########
website/www/site/content/en/documentation/sdks/yaml.md:
##########
@@ -23,80 +23,132 @@ title: "Apache Beam YAML API"
 
 # Beam YAML API
 
-While Beam provides powerful APIs for authoring sophisticated data
-processing pipelines, it often still has too high a barrier for
-getting started and authoring simple pipelines. Even setting up the
-environment, installing the dependencies, and setting up the project
-can be an overwhelming amount of boilerplate for some (though
-https://beam.apache.org/blog/beam-starter-projects/ has gone a long
-way in making this easier).
-
-Here we provide a simple declarative syntax for describing pipelines
-that does not require coding experience or learning how to use an
-SDK&mdash;any text editor will do.
-Some installation may be required to actually *execute* a pipeline, but
-we envision various services (such as Dataflow) to accept yaml pipelines
-directly obviating the need for even that in the future.
-We also anticipate the ability to generate code directly from these
-higher-level yaml descriptions, should one want to graduate to a full
-Beam SDK (and possibly the other direction as well as far as possible).
-
-Though we intend this syntax to be easily authored (and read) directly by
-humans, this may also prove a useful intermediate representation for
-tools to use as well, either as output (e.g. a pipeline authoring GUI)
-or consumption (e.g. a lineage analysis tool) and expect it to be more
-easily manipulated and semantically meaningful than the Beam protos
-themselves (which concern themselves more with execution).
-
-It should be noted that everything here is still under development, but any
-features already included are considered stable. Feedback is welcome at
[email protected].
-
-## Running pipelines
-
-The Beam yaml parser is currently included as part of the Apache Beam Python 
SDK.
-This can be installed (e.g. within a virtual environment) as
+Beam YAML is a declarative syntax for describing Apache Beam pipelines by using
+YAML files. You can use Beam YAML to author and run a Beam pipeline without
+writing any code.
+
+## Overview
+
+Beam provides a powerful model for creating sophisticated data processing
+pipelines. However, getting started with Beam programming can be challenging
+because it requires writing code in one of the supported Beam SDK languages.
+You need to understand the APIs, set up a project, manage dependencies, and
+perform other programming tasks.
+
+Beam YAML makes it easier to get started with creating Beam pipelines. Instead
+of writing code, you create a YAML file using any text editor. Then you submit
+the YAML file to be executed by a runner.
+
+The Beam YAML syntax is designed to be human-readable but also suitable as an
+intermediate representation for tools. For example, a pipeline authoring GUI
+could output YAML, or a lineage analysis tool could consume the YAML pipeline
+specifications.
+
+Beam YAML is still under development, but any features already included are
+considered stable. Feedback is welcome at [email protected].
+
+## Prerequisites
+
+The Beam YAML parser is currently included as part of the
+[Apache Beam Python SDK](../python/). You don't need to write Python code to 
use
+Beam YAML, but you need the SDK to run pipelines locally.
+
+We recommend creating a
+[virtual 
environment](../../../get-started/quickstart/python/#create-and-activate-a-virtual-environment)
+so that all packages are installed in an isolated and self-contained
+environment. After you set up your Python environment, install the SDK as
+follows:
 
 ```
 pip install apache_beam[yaml,gcp]
 ```
 
-In addition, several of the provided transforms (such as SQL) are implemented
-in Java and their expansion will require a working Java interpeter. (The
-requisite artifacts will be automatically downloaded from the apache maven
-repositories, so no further installs will be required.)
-Docker is also currently required for local execution of these
-cross-language-requiring transforms, but not for submission to a non-local
-runner such as Flink or Dataflow.
+In addition, several of the provided transforms, such as the SQL transform, are
+implemented in Java and require a working Java interpeter. When you a run a
+pipeline with these transforms, the required artifacts are automatically
+downloaded from the Apache Maven repositories. To execute these cross-language
+transforms locally, you must have Docker installed on your local machine.

Review Comment:
   The part about Docker is no longer true since 
https://github.com/apache/beam/pull/29283 . (Are there other places we should 
be updating this as well?)



##########
website/www/site/content/en/documentation/sdks/yaml.md:
##########
@@ -23,80 +23,132 @@ title: "Apache Beam YAML API"
 
 # Beam YAML API
 
-While Beam provides powerful APIs for authoring sophisticated data
-processing pipelines, it often still has too high a barrier for
-getting started and authoring simple pipelines. Even setting up the
-environment, installing the dependencies, and setting up the project
-can be an overwhelming amount of boilerplate for some (though
-https://beam.apache.org/blog/beam-starter-projects/ has gone a long
-way in making this easier).
-
-Here we provide a simple declarative syntax for describing pipelines
-that does not require coding experience or learning how to use an
-SDK&mdash;any text editor will do.
-Some installation may be required to actually *execute* a pipeline, but
-we envision various services (such as Dataflow) to accept yaml pipelines
-directly obviating the need for even that in the future.
-We also anticipate the ability to generate code directly from these
-higher-level yaml descriptions, should one want to graduate to a full
-Beam SDK (and possibly the other direction as well as far as possible).
-
-Though we intend this syntax to be easily authored (and read) directly by
-humans, this may also prove a useful intermediate representation for
-tools to use as well, either as output (e.g. a pipeline authoring GUI)
-or consumption (e.g. a lineage analysis tool) and expect it to be more
-easily manipulated and semantically meaningful than the Beam protos
-themselves (which concern themselves more with execution).
-
-It should be noted that everything here is still under development, but any
-features already included are considered stable. Feedback is welcome at
[email protected].
-
-## Running pipelines
-
-The Beam yaml parser is currently included as part of the Apache Beam Python 
SDK.
-This can be installed (e.g. within a virtual environment) as
+Beam YAML is a declarative syntax for describing Apache Beam pipelines by using
+YAML files. You can use Beam YAML to author and run a Beam pipeline without
+writing any code.
+
+## Overview
+
+Beam provides a powerful model for creating sophisticated data processing
+pipelines. However, getting started with Beam programming can be challenging
+because it requires writing code in one of the supported Beam SDK languages.
+You need to understand the APIs, set up a project, manage dependencies, and
+perform other programming tasks.
+
+Beam YAML makes it easier to get started with creating Beam pipelines. Instead

Review Comment:
   +1, that's a good idea. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Proposed edits for Beam YAML overview [beam]

Reply via email to