alxp1982 commented on code in PR #23085: URL: https://github.com/apache/beam/pull/23085#discussion_r966541256
########## learning/tour-of-beam/learning-content/java/introduction/introduction-concepts/creating-collections/from-memory/example/ParDoExample.java: ########## @@ -0,0 +1,82 @@ +/* Review Comment: Please change the name of the file and task. In this learning context, we're not showing ParDo but rather PCollection creation from in-memory data ########## learning/tour-of-beam/learning-content/java/introduction/introduction-concepts/basic-concepts/description.md: ########## @@ -0,0 +1,134 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +# Tour of Beam Programming Guide Review Comment: I think we need to split this unit. It is too long and overloaded. Let's do the following: - Creating a pipeline - Setting PipelineOptions - Creating custom options ########## learning/tour-of-beam/learning-content/java/introduction/introduction-concepts/basic-concepts/example/Task.java: ########## @@ -0,0 +1,69 @@ +/* Review Comment: The unit itself was focused on creating and configuring the pipeline. The example needs to include learned concepts. ########## learning/tour-of-beam/learning-content/java/introduction/introduction-concepts/basic-concepts/description.md: ########## @@ -0,0 +1,134 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +# Tour of Beam Programming Guide + +The Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. This guide provides guidance for using the Beam SDK classes to build and test pipelines. The programming guide is not intended to be an exhaustive reference, but rather a language-agnostic, high-level guide to programmatically building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam concepts in your pipelines. Review Comment: Tour of Beam Programing Guide section isn't needed ########## learning/tour-of-beam/learning-content/java/introduction/introduction-concepts/basic-concepts/description.md: ########## @@ -0,0 +1,134 @@ +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> +# Tour of Beam Programming Guide + +The Beam Programming Guide is intended for Beam users who want to use the Beam SDKs to create data processing pipelines. This guide provides guidance for using the Beam SDK classes to build and test pipelines. The programming guide is not intended to be an exhaustive reference, but rather a language-agnostic, high-level guide to programmatically building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam concepts in your pipelines. + +For a brief introduction to Beam’s basic concepts,take a look at the Basics of the Beam model page before reading the programming guide. + +### Overview + +To use Beam, you need to first create a driver program using the classes in one of the Beam SDKs. Your driver program defines your pipeline, including all of the inputs, transforms, and outputs; it also sets execution options for your pipeline (typically passed by using command-line options). These include the Pipeline Runner, which, in turn, determines what back-end your pipeline will run on. + +The Beam SDKs provide a number of abstractions that simplify the mechanics of large-scale distributed data processing. The same Beam abstractions work with both batch and streaming data sources. When you create your Beam pipeline, you can think about your data processing task in terms of these abstractions. They include: + +→ `Pipeline`: A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run. + +→ `PCollection`: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline. + +→ `PTransform`: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects. + +→ `Scope`: The Go SDK has an explicit scope variable used to build a Pipeline. A Pipeline can return it’s root scope with the `Root()` method. The scope variable is passed to PTransform functions to place them in the Pipeline that owns the Scope. + +→ `I/O transforms`: Beam comes with a number of “IOs” - library PTransforms that read or write data to various external storage systems. + +A typical Beam driver program works as follows: + +→ Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner. + +→ Create an initial PCollection for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build a PCollection from in-memory data. + +→ Apply PTransforms to each PCollection. Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. A transform creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in turn until processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think of PCollections as variables and PTransforms as functions applied to these variables: the shape of the pipeline can be an arbitrarily complex processing graph. + +→ Use IOs to write the final, transformed PCollection(s) to an external source. + +→ Run the pipeline using the designated Pipeline Runner. + +When you run your Beam driver program, the Pipeline Runner that you designate constructs a workflow graph of your pipeline based on the PCollection objects you’ve created and transforms that you’ve applied. That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous “job” (or equivalent) on that back-end. + +### Creating a pipeline + +The `Pipeline` abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing a `Pipeline` object, and then using that object as the basis for creating the pipeline’s data sets as `PCollection`s and its operations as `Transforms`. + +To use Beam, your driver program must first create an instance of the Beam SDK class `Pipeline` (typically in the main() function). When you create your `Pipeline`, you’ll also need to set some configuration options. You can set your pipeline’s configuration options programmatically, but it’s often easier to set the options ahead of time (or read them from the command line) and pass them to the `Pipeline` object when you create the object. + +``` +// Start by defining the options for the pipeline. +PipelineOptions options = PipelineOptionsFactory.create(); + +// Then create the pipeline. +Pipeline p = Pipeline.create(options); +``` + +### Configuring pipeline options + +Use the pipeline options to configure different aspects of your pipeline, such as the pipeline runner that will execute your pipeline and any runner-specific configuration required by the chosen runner. Your pipeline options will potentially include information such as your project ID or a location for storing files. + +When you run the pipeline on a runner of your choice, a copy of the PipelineOptions will be available to your code. For example, if you add a PipelineOptions parameter to a DoFn’s `@ProcessElement` method, it will be populated by the system. + +### Setting PipelineOptions from command-line arguments + +While you can configure your pipeline by creating a PipelineOptions object and setting the fields directly, the Beam SDKs include a command-line parser that you can use to set fields in PipelineOptions using command-line arguments. + +To read options from the command-line, construct your PipelineOptions object as demonstrated in the following example code: + +``` +PipelineOptions options = + PipelineOptionsFactory.fromArgs(args).withValidation().create(); +``` + +This interprets command-line arguments that follow the format: + +``` +--<option>=<value> +``` + +> Appending the method .withValidation will check for required command-line arguments and validate argument values. + +### Creating custom options + +You can add your own custom options in addition to the standard `PipelineOptions`. + +To add your own options, define an interface with getter and setter methods for each option. + +The following example shows how to add `input` and `output` custom options: + +``` +public interface MyOptions extends PipelineOptions { + String getInput(); + void setInput(String input); + + String getOutput(); + void setOutput(String output); +} +``` + +You can also specify a description, which appears when a user passes `--help` as a command-line argument, and a default value. + +You set the description and default value using annotations, as follows: + +``` +public interface MyOptions extends PipelineOptions { + @Description("Input for the pipeline") + @Default.String("gs://my-bucket/input") + String getInput(); + void setInput(String input); + + @Description("Output for the pipeline") + @Default.String("gs://my-bucket/output") + String getOutput(); + void setOutput(String output); +} +``` + +It’s recommended that you register your interface with `PipelineOptionsFactory` and then pass the interface when creating the `PipelineOptions` object. When you register your interface with `PipelineOptionsFactory`, the `--help` can find your custom options interface and add it to the output of the --help command. `PipelineOptionsFactory` will also validate that your custom options are compatible with all other registered options. + +The following example code shows how to register your custom options interface with `PipelineOptionsFactory`: + +``` +PipelineOptionsFactory.register(MyOptions.class); +MyOptions options = PipelineOptionsFactory.fromArgs(args) + .withValidation() + .as(MyOptions.class); +``` Review Comment: Please add a description of the task, an invitation to run and experiment with it as well as a few small challenges. Similarly, how it is done in other places, e.g. 'In the playground window, you can find an example of ... that you can run and experiment with. Can you modify it so that ...' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
