TheNeuralBit commented on a change in pull request #14717: URL: https://github.com/apache/beam/pull/14717#discussion_r625922517
########## File path: website/www/site/content/en/documentation/glossary.md ########## @@ -0,0 +1,464 @@ +--- +title: "Beam glossary" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Apache Beam glossary + +## Aggregation + +A transform pattern for computing a value from multiple input elements. Aggregation is similar to the reduce operation in the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) model. Aggregation transforms include Count (computes the count of all elements in the aggregation), Max (computes the maximum element in the aggregation), and Sum (computes the sum of all elements in the aggregation). + +For a complete list of aggregation transforms, see: + +* [Java Transform catalog](/documentation/transforms/java/overview/) +* [Python Transform catalog](/documentation/transforms/python/overview/) + +## Apply + +A method for invoking a transform on a PCollection. Each transform in the Beam SDKs has a generic `apply` method (or pipe operator `|`). Invoking multiple Beam transforms is similar to method chaining, but with a difference: You apply the transform to the input PCollection, passing the transform itself as an argument, and the operation returns the output PCollection. Because of Beam’s deferred execution model, applying a transform does not immediately execute that transform. + +To learn more, see: + +* [Applying transforms](/documentation/programming-guide/#applying-transforms) + +## Batch processing + +A data processing paradigm for working with finite, or bounded, datasets. A bounded PCollection represents a dataset of a known, fixed size. Reading from a batch data source, such as a file or a database, creates a bounded PCollection. A batch processing job eventually ends, in contrast to a streaming job, which runs until cancelled. + +To learn more, see: + +* [Size and boundedness](/documentation/programming-guide/#size-and-boundedness) + +## Bounded data + +A dataset of a known, fixed size. A PCollection can be bounded or unbounded, depending on the source of the data that it represents. Reading from a batch data source, such as a file or a database, creates a bounded PCollection. Beam also supports reading a bounded amount of data from an unbounded source. + +To learn more, see: + +* [Size and boundedness](/documentation/programming-guide/#size-and-boundedness) + +## Bundle + +The processing unit for elements in a PCollection. Instead of processing all elements in a PCollection simultaneously, Beam processes the elements in bundles. The runner handles the division of the collection into bundles, and in doing so it may optimize the bundle size for the use case. For example, a streaming runner might process smaller bundles than a batch runner. + +To learn more, see: + +* [Bundling and persistence](/documentation/runtime/model/#bundling-and-persistence) + +## Coder + +A component that describes how the elements of a PCollection can be encoded and decoded. To support distributed processing and cross-language portability, Beam needs to be able to encode each element of a PCollection as bytes. The Beam SDKs provide built-in coders for common types and language-specific mechanisms for specifying the encoding of a PCollection. + +To learn more, see: + +* [Data encoding and type safety](/documentation/programming-guide/#data-encoding-and-type-safety) + +## CoGroupByKey + +A PTransform that takes two or more PCollections and aggregates the elements by key. In effect, CoGroupByKey performs a relational join of two or more key/value PCollections that have the same key type. While GroupByKey performs this operation over a single input collection, CoGroupByKey operates over multiple input collections. + +To learn more, see: + +* [CoGroupByKey](/documentation/programming-guide/#cogroupbykey) +* [CoGroupByKey (Java)](/documentation/transforms/java/aggregation/cogroupbykey/) +* [CoGroupByKey (Python)](/documentation/transforms/python/aggregation/cogroupbykey/) + +## Collection + +See [PCollection](/documentation/glossary/#pcollection). + +## Combine + +A PTransform for combining all elements of a PCollection or all values associated with a key. When you apply a Combine transform, you have to provide a user-defined function (UDF) that contains the logic for combining the elements or values. The combining function should be [commutative](https://en.wikipedia.org/wiki/Commutative_property) and [associative](https://en.wikipedia.org/wiki/Associative_property), because the function is not necessarily invoked exactly once on all values with a given key. + +To learn more, see: + +* [Combine](/documentation/programming-guide/#combine) +* [Combine (Java)](/documentation/transforms/java/aggregation/combine/) +* [CombineGlobally (Python)](/documentation/transforms/python/aggregation/combineglobally/) +* [CombinePerKey (Python)](/documentation/transforms/python/aggregation/combineperkey/) +* [CombineValues (Python)](/documentation/transforms/python/aggregation/combinevalues/) + +## Composite transform + +A PTransform that expands into many PTransforms. Composite transforms have a nested structure, in which a complex transform applies one or more simpler transforms. These simpler transforms could be existing Beam operations like ParDo, Combine, or GroupByKey, or they could be other composite transforms. Nesting multiple transforms inside a single composite transform can make your pipeline more modular and easier to understand. + +To learn more, see: + +* [Composite transforms](/documentation/programming-guide/#composite-transforms) + +## Counter (metric) + +A metric that reports a single long value and can be incremented. In the Beam model, metrics provide insight into the state of a pipeline, potentially while the pipeline is running. + +To learn more, see: + +* [Types of metrics](/documentation/programming-guide/#types-of-metrics) + +## Cross-language transforms + +Transforms that can be shared across Beam SDKs. With cross-language transforms, you can use transforms written in any supported SDK language (currently, Java and Python) in a pipeline written in a different SDK language. For example, you could use the Apache Kafka connector from the Java SDK in a Python streaming pipeline. Cross-language transforms make it possible to provide new functionality simultaneously in different SDKs. + +To learn more, see: + +* [Multi-language pipelines](/documentation/programming-guide/#mulit-language-pipelines) + +## Deferred execution + +A feature of the Beam execution model. Beam operations are deferred, meaning that the result of a given operation may not be available for control flow. Deferred execution allows the Beam API to support parallel processing of data. + +## Distribution (metric) + +A metric that reports information about the distribution of reported values. In the Beam model, metrics provide insight into the state of a pipeline, potentially while the pipeline is running. + +To learn more, see: + +* [Types of metrics](/documentation/programming-guide/#types-of-metrics) + +## DoFn + +A function object used by ParDo (or some other transform) to process the elements of a PCollection. A DoFn is a user-defined function, meaning that it contains custom code that defines a data processing task in your pipeline. The Beam system invokes a DoFn one or more times to process some arbitrary bundle of elements, but Beam doesn’t guarantee an exact number of invocations. + +To learn more, see: + +* [ParDo](/documentation/programming-guide/#pardo) + +## Driver + +A program that defines your pipeline, including all of the inputs, transforms, and outputs. To use Beam, you need to create a driver program using classes from one of the Beam SDKs. The driver program creates a pipeline and specifies the execution options that tell the pipeline where and how to run. These options include the runner, which determines what backend your pipeline will run on. + +To learn more, see: + +* [Overview](/documentation/programming-guide/#overview) + +## Element + +The unit of data in a PCollection. Elements in a PCollection can be of any type, but they must all have the same type. This allows parallel computations to operate uniformly across the entire collection. Some element types have a structure that can be introspected (for example, JSON, Protocol Buffer, Avro, and database records). + +To learn more, see: + +* [PCollection characteristics](/documentation/programming-guide/#pcollection-characteristics) + +## Element-wise + +A type of transform that independently processes each element in an input PCollection. An element-wise transform might output 0, 1, or multiple values for each input element. This is in contrast to aggregation transforms, which compute a single value from multiple input elements. Element-wise operations include Filter, FlatMap, and ParDo. + +For a complete list of element-wise transforms, see: + +* [Java Transform catalog](/documentation/transforms/java/overview/) +* [Python Transform catalog](/documentation/transforms/python/overview/) Review comment: nit: ```suggestion * [Java Transform catalog](/documentation/transforms/java/overview/#element-wise) * [Python Transform catalog](/documentation/transforms/python/overview/#element-wise) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
