[GitHub] [beam] melap commented on a change in pull request #15720: [BEAM-11758] Update basics page: Pipeline, PCollection, PTransform

GitBox Wed, 20 Oct 2021 16:42:56 -0700


melap commented on a change in pull request #15720:
URL: https://github.com/apache/beam/pull/15720#discussion_r733218178




##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth 
highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the 
SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output 
collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data 
sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: 
Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a 
pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements 
of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a 
step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to 
each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across 
a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform

Review comment:
       Done - changed to "source"

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth 
highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the 
SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output 
collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data 
sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: 
Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a 
pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements 
of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a 
step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to 
each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across 
a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform
+   conceptually has no input.
+ * Processing and conversion operations such as `ParDo`, `GroupByKey`,
+   `CoGroupByKey`, `Combine`, and `Count`.
+ * Outputting transforms like `TextIO.Write`.
+ * User-defined, application-specific composite transforms.
+
+For more information about transforms, see the following pages:
+
+ * [Beam Programming Guide: 
Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: 
Transforms](/documentation/programming-guide/#transforms)
+ * Beam transform catalog ([Java](/documentation/transforms/java/overview/),
+   [Python](/documentation/transforms/python/overview/))
 
 ### PCollections
 
-A PCollection is an unordered bag of elements. Your runner will be responsible

Review comment:
       Done - rewrote this paragraph and put it back

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth 
highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in

Review comment:
       Hmm, I do state that when I explain the diagram just below this, though 
I wasn't linking to the sections, so added the links. Do you think that is 
sufficient?

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -22,11 +22,13 @@ of operations. You want to integrate it with the Beam 
ecosystem to get access
 to other languages, great event time processing, and a library of connectors.
 You need to know the core vocabulary:
 
- * [_Pipeline_](#pipeline) - A pipeline is a graph of transformations that a 
user constructs
-   that defines the data processing they want to do.
- * [_PCollection_](#pcollections) - Data being processed in a pipeline is part 
of a PCollection.
- * [_PTransforms_](#ptransforms) - The operations executed within a pipeline. 
These are best
-   thought of as operations on PCollections.
+ * [_Pipeline_](#pipeline) - A pipeline is a user-constructed graph of
+   transformations that defines the desired data processing operations.
+ * [_PCollection_](#pcollections) - A `PCollection` is a data set or data
+   stream. The data that a pipeline processes is part of a PCollection.
+ * [_PTransforms_](#ptransforms) - A `PTransform` (or _transform_) represents a
+   data processing operation, or a step, in your pipeline. A transform can be
+   applied to one or more `PCollection` objects.

Review comment:
       Done - made some tweaks to this bullet item and also made some changes 
to the main PCollection section around this.

##########
File path: website/www/site/content/en/documentation/basics.md
##########
@@ -40,102 +42,171 @@ transforms, there are some special features worth 
highlighting.
 
 ### Pipeline
 
-A pipeline in Beam is a graph of PTransforms operating on PCollections. A
-pipeline is constructed by a user in their SDK of choice, and makes its way to
-your runner either via the SDK directly or via the Runner API's
-RPC interfaces.
+A Beam pipeline is a directed acyclic graph of all the data and computations in
+your data processing task. This includes reading input data, transforming that
+data, and writing output data. A pipeline is constructed by a user in their SDK
+of choice. Then, the pipeline makes its way to the runner either through the 
SDK
+directly or through the Runner API's RPC interface. For example, this diagram
+shows a branching pipeline:
+
+![The pipeline applies two transforms to a single input collection. Each
+  transform produces an output 
collection.](/images/design-your-pipeline-multiple-pcollections.svg)
+
+In the diagram, the boxes are parallel transformations called _PTransforms_ and
+the arrows with the circles represent the data (in the form of _PCollections_)
+that flows between the transforms. The data might be bounded, stored, data 
sets,
+or the data might also be unbounded streams of data. In Beam, most transforms
+apply equally to bounded and unbounded data.
+
+You can express almost any computation that you can think of as a graph as a
+Beam pipeline. A Beam driver program typically starts by creating a `Pipeline`
+object, and then uses that object as the basis for creating the pipeline’s data
+sets and its transforms.
+
+For more information about pipelines, see the following pages:
+
+ * [Beam Programming Guide: 
Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: Creating a 
pipeline](/documentation/programming-guide/#creating-a-pipeline)
+ * [Design your pipeline](/documentation/pipelines/design-your-pipeline)
+ * [Create your pipeline](/documentation/pipeline/create-your-pipeline)
 
 ### PTransforms
 
-A `PTransform` represents a data processing operation, or a step,
-in your pipeline. A `PTransform` can be applied to one or more
-`PCollection` objects as input which performs some processing on the elements 
of that
-`PCollection` and produces zero or more output `PCollection` objects.
+A `PTransform` (or transform) represents a data processing operation, or a 
step,
+in your pipeline. A transform can be applied to one or more input `PCollection`
+objects. You provide processing logic in the form of a function object
+(colloquially referred to as “user code”), and your user code is applied to 
each
+element of the input PCollection (or more than one PCollection).  Depending on
+the pipeline runner and backend that you choose, many different workers across 
a
+cluster might execute instances of your user code in parallel.  The user code
+that runs on each worker generates the output elements that are added to the
+zero or more output `PCollection` objects.
+
+The Beam SDKs contain a number of different transforms that you can apply to
+your pipeline’s PCollections. These include general-purpose core transforms,
+such as `ParDo` or `Combine`. There are also pre-written composite transforms
+included in the SDKs, which combine one or more of the core transforms in a
+useful processing pattern, such as counting or combining elements in a
+collection. You can also define your own more complex composite transforms to
+fit your pipeline’s exact use case.
+
+The following list has some common transform types:
+
+ * Root transforms such as `TextIO.Read` and `Create`. A root transform
+   conceptually has no input.
+ * Processing and conversion operations such as `ParDo`, `GroupByKey`,
+   `CoGroupByKey`, `Combine`, and `Count`.
+ * Outputting transforms like `TextIO.Write`.
+ * User-defined, application-specific composite transforms.
+
+For more information about transforms, see the following pages:
+
+ * [Beam Programming Guide: 
Overview](/documentation/programming-guide/#overview)
+ * [Beam Programming Guide: 
Transforms](/documentation/programming-guide/#transforms)
+ * Beam transform catalog ([Java](/documentation/transforms/java/overview/),
+   [Python](/documentation/transforms/python/overview/))
 
 ### PCollections
 
-A PCollection is an unordered bag of elements. Your runner will be responsible
-for storing these elements.  There are some major aspects of a PCollection to
-note:
+Beam pipelines process PCollections. A `PCollection` is a potentially
+distributed, homogeneous data set or data stream, and is owned by the specific
+`Pipeline` object for which it is created. Multiple pipelines cannot share a
+`PCollection`. The runner is responsible for storing these elements.
+
+A PCollection generally contains "big data" (too much data to fit in memory on 
a
+single machine). Sometimes a small sample of data or an intermediate result
+might fit into memory on a single machine, but Beam's computational patterns 
and
+transforms are focused on situations where distributed data-parallel 
computation
+is required. Therefore, the elements of a `PCollection` cannot be processed
+individually, and are instead processed uniformly in parallel.
 
-#### Bounded vs Unbounded
+There are some major aspects of a PCollection to note:
 
-A PCollection may be bounded or unbounded.
+####  Bounded vs unbounded
 
- - _Bounded_ - it is finite and you know it, as in batch use cases
- - _Unbounded_ - it may be never end, you don't know, as in streaming use cases
+A `PCollection` can be either bounded or unbounded.
 
-These derive from the intuitions of batch and stream processing, but the two
-are unified in Beam and bounded and unbounded PCollections can coexist in the
-same pipeline. If your runner can only support bounded PCollections, you'll
-need to reject pipelines that contain unbounded PCollections. If your
-runner is only really targeting streams, there are adapters in our support code
-to convert everything to APIs targeting unbounded data.
+ - _Bounded_ - A bounded `PCollection` is a dataset of a known, fixed size
+   (alternatively, a dataset that is not growing over time). Bounded data can
+   be processed by batch pipelines.
+ - _Unbounded_ - An unbounded PCollection is a dataset that grows over time,
+   with elements processed as they arrive. Unbounded data must be processed by
+   streaming pipelines.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] melap commented on a change in pull request #15720: [BEAM-11758] Update basics page: Pipeline, PCollection, PTransform

Reply via email to