knaufk commented on a change in pull request #11092: [FLINK-15999] Extract “Concepts” material from API/Library sections and start proper concepts section URL: https://github.com/apache/flink/pull/11092#discussion_r379550966
########## File path: docs/concepts/stream-processing.md ########## @@ -0,0 +1,96 @@ +--- +title: Stream Processing +nav-id: stream-processing +nav-pos: 1 +nav-title: Stream Processing +nav-parent_id: concepts +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +`TODO: Add introduction` +* This will be replaced by the TOC +{:toc} + +## A Unified System for Batch & Stream Processing + +`TODO` + +{% top %} + +## Programs and Dataflows + +The basic building blocks of Flink programs are **streams** and +**transformations**. Conceptually a *stream* is a (potentially never-ending) +flow of data records, and a *transformation* is an operation that takes one or +more streams as input, and produces one or more output streams as a result. + +When executed, Flink programs are mapped to **streaming dataflows**, consisting +of **streams** and transformation **operators**. Each dataflow starts with one +or more **sources** and ends in one or more **sinks**. The dataflows resemble +arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of +cycles are permitted via *iteration* constructs, for the most part we will +gloss over this for simplicity. + +<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream program, and its dataflow." class="offset" width="80%" /> + +Often there is a one-to-one correspondence between the transformations in the +programs and the operators in the dataflow. Sometimes, however, one +transformation may consist of multiple transformation operators. + +{% top %} + +## Parallel Dataflows + +Programs in Flink are inherently parallel and distributed. During execution, a +*stream* has one or more **stream partitions**, and each *operator* has one or +more **operator subtasks**. The operator subtasks are independent of one +another, and execute in different threads and possibly on different machines or +containers. + +The number of operator subtasks is the **parallelism** of that particular +operator. The parallelism of a stream is always that of its producing operator. +Different operators of the same program may have different levels of +parallelism. + +<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel dataflow" class="offset" width="80%" /> + +Streams can transport data between two operators in a *one-to-one* (or Review comment: I think, different redistribution patterns that Fabian it in his book is more to the point. I think it was: * Forward * Broadcast * Random * Keyed IMHO the additional classification in "Redistributing" and "One-to-one" does not help. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
