[GitHub] [beam] TheNeuralBit commented on a change in pull request #14717: [BEAM-11759] Create Beam glossary

GitBox Tue, 04 May 2021 09:21:08 -0700


TheNeuralBit commented on a change in pull request #14717:
URL: https://github.com/apache/beam/pull/14717#discussion_r625922517




##########
File path: website/www/site/content/en/documentation/glossary.md
##########
@@ -0,0 +1,464 @@
+---
+title: "Beam glossary"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Apache Beam glossary
+
+## Aggregation
+
+A transform pattern for computing a value from multiple input elements. 
Aggregation is similar to the reduce operation in the 
[MapReduce](https://en.wikipedia.org/wiki/MapReduce) model. Aggregation 
transforms include Count (computes the count of all elements in the 
aggregation), Max (computes the maximum element in the aggregation), and Sum 
(computes the sum of all elements in the aggregation).
+
+For a complete list of aggregation transforms, see:
+
+* [Java Transform catalog](/documentation/transforms/java/overview/)
+* [Python Transform catalog](/documentation/transforms/python/overview/)
+
+## Apply
+
+A method for invoking a transform on a PCollection. Each transform in the Beam 
SDKs has a generic `apply` method (or pipe operator `|`). Invoking multiple 
Beam transforms is similar to method chaining, but with a difference: You apply 
the transform to the input PCollection, passing the transform itself as an 
argument, and the operation returns the output PCollection. Because of Beam’s 
deferred execution model, applying a transform does not immediately execute 
that transform.
+
+To learn more, see:
+
+* [Applying transforms](/documentation/programming-guide/#applying-transforms)
+
+## Batch processing
+
+A data processing paradigm for working with finite, or bounded, datasets. A 
bounded PCollection represents a dataset of a known, fixed size. Reading from a 
batch data source, such as a file or a database, creates a bounded PCollection. 
A batch processing job eventually ends, in contrast to a streaming job, which 
runs until cancelled.
+
+To learn more, see:
+
+* [Size and 
boundedness](/documentation/programming-guide/#size-and-boundedness)
+
+## Bounded data
+
+A dataset of a known, fixed size. A PCollection can be bounded or unbounded, 
depending on the source of the data that it represents. Reading from a batch 
data source, such as a file or a database, creates a bounded PCollection. Beam 
also supports reading a bounded amount of data from an unbounded source.
+
+To learn more, see:
+
+* [Size and 
boundedness](/documentation/programming-guide/#size-and-boundedness)
+
+## Bundle
+
+The processing unit for elements in a PCollection. Instead of processing all 
elements in a PCollection simultaneously, Beam processes the elements in 
bundles. The runner handles the division of the collection into bundles, and in 
doing so it may optimize the bundle size for the use case. For example, a 
streaming runner might process smaller bundles than a batch runner.
+
+To learn more, see:
+
+* [Bundling and 
persistence](/documentation/runtime/model/#bundling-and-persistence)
+
+## Coder
+
+A component that describes how the elements of a PCollection can be encoded 
and decoded. To support distributed processing and cross-language portability, 
Beam needs to be able to encode each element of a PCollection as bytes. The 
Beam SDKs provide built-in coders for common types and language-specific 
mechanisms for specifying the encoding of a PCollection.
+
+To learn more, see:
+
+* [Data encoding and type 
safety](/documentation/programming-guide/#data-encoding-and-type-safety)
+
+## CoGroupByKey
+
+A PTransform that takes two or more PCollections and aggregates the elements 
by key. In effect, CoGroupByKey performs a relational join of two or more 
key/value PCollections that have the same key type. While GroupByKey performs 
this operation over a single input collection, CoGroupByKey operates over 
multiple input collections.
+
+To learn more, see:
+
+* [CoGroupByKey](/documentation/programming-guide/#cogroupbykey)
+* [CoGroupByKey 
(Java)](/documentation/transforms/java/aggregation/cogroupbykey/)
+* [CoGroupByKey 
(Python)](/documentation/transforms/python/aggregation/cogroupbykey/)
+
+## Collection
+
+See [PCollection](/documentation/glossary/#pcollection).
+
+## Combine
+
+A PTransform for combining all elements of a PCollection or all values 
associated with a key. When you apply a Combine transform, you have to provide 
a user-defined function (UDF) that contains the logic for combining the 
elements or values. The combining function should be 
[commutative](https://en.wikipedia.org/wiki/Commutative_property) and 
[associative](https://en.wikipedia.org/wiki/Associative_property), because the 
function is not necessarily invoked exactly once on all values with a given key.
+
+To learn more, see:
+
+* [Combine](/documentation/programming-guide/#combine)
+* [Combine (Java)](/documentation/transforms/java/aggregation/combine/)
+* [CombineGlobally 
(Python)](/documentation/transforms/python/aggregation/combineglobally/)
+* [CombinePerKey 
(Python)](/documentation/transforms/python/aggregation/combineperkey/)
+* [CombineValues 
(Python)](/documentation/transforms/python/aggregation/combinevalues/)
+
+## Composite transform
+
+A PTransform that expands into many PTransforms. Composite transforms have a 
nested structure, in which a complex transform applies one or more simpler 
transforms. These simpler transforms could be existing Beam operations like 
ParDo, Combine, or GroupByKey, or they could be other composite transforms. 
Nesting multiple transforms inside a single composite transform can make your 
pipeline more modular and easier to understand.
+
+To learn more, see:
+
+* [Composite 
transforms](/documentation/programming-guide/#composite-transforms)
+
+## Counter (metric)
+
+A metric that reports a single long value and can be incremented. In the Beam 
model, metrics provide insight into the state of a pipeline, potentially while 
the pipeline is running.
+
+To learn more, see:
+
+* [Types of metrics](/documentation/programming-guide/#types-of-metrics)
+
+## Cross-language transforms
+
+Transforms that can be shared across Beam SDKs. With cross-language 
transforms, you can use transforms written in any supported SDK language 
(currently, Java and Python) in a pipeline written in a different SDK language. 
For example, you could use the Apache Kafka connector from the Java SDK in a 
Python streaming pipeline. Cross-language transforms make it possible to 
provide new functionality simultaneously in different SDKs.
+
+To learn more, see:
+
+* [Multi-language 
pipelines](/documentation/programming-guide/#mulit-language-pipelines)
+
+## Deferred execution
+
+A feature of the Beam execution model. Beam operations are deferred, meaning 
that the result of a given operation may not be available for control flow. 
Deferred execution allows the Beam API to support parallel processing of data.
+
+## Distribution (metric)
+
+A metric that reports information about the distribution of reported values. 
In the Beam model, metrics provide insight into the state of a pipeline, 
potentially while the pipeline is running.
+
+To learn more, see:
+
+* [Types of metrics](/documentation/programming-guide/#types-of-metrics)
+
+## DoFn
+
+A function object used by ParDo (or some other transform) to process the 
elements of a PCollection. A DoFn is a user-defined function, meaning that it 
contains custom code that defines a data processing task in your pipeline. The 
Beam system invokes a DoFn one or more times to process some arbitrary bundle 
of elements, but Beam doesn’t guarantee an exact number of invocations.
+
+To learn more, see:
+
+* [ParDo](/documentation/programming-guide/#pardo)
+
+## Driver
+
+A program that defines your pipeline, including all of the inputs, transforms, 
and outputs. To use Beam, you need to create a driver program using classes 
from one of the Beam SDKs. The driver program creates a pipeline and specifies 
the execution options that tell the pipeline where and how to run. These 
options include the runner, which determines what backend your pipeline will 
run on.
+
+To learn more, see:
+
+* [Overview](/documentation/programming-guide/#overview)
+
+## Element
+
+The unit of data in a PCollection. Elements in a PCollection can be of any 
type, but they must all have the same type. This allows parallel computations 
to operate uniformly across the entire collection. Some element types have a 
structure that can be introspected (for example, JSON, Protocol Buffer, Avro, 
and database records).
+
+To learn more, see:
+
+* [PCollection 
characteristics](/documentation/programming-guide/#pcollection-characteristics)
+
+## Element-wise
+
+A type of transform that independently processes each element in an input 
PCollection. An element-wise transform might output 0, 1, or multiple values 
for each input element. This is in contrast to aggregation transforms, which 
compute a single value from multiple input elements. Element-wise operations 
include Filter, FlatMap, and ParDo.
+
+For a complete list of element-wise transforms, see:
+
+* [Java Transform catalog](/documentation/transforms/java/overview/)
+* [Python Transform catalog](/documentation/transforms/python/overview/)

Review comment:
       nit:
   ```suggestion
   * [Java Transform 
catalog](/documentation/transforms/java/overview/#element-wise)
   * [Python Transform 
catalog](/documentation/transforms/python/overview/#element-wise)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14717: [BEAM-11759] Create Beam glossary

Reply via email to