NicoK commented on a change in pull request #11826:
URL: https://github.com/apache/flink/pull/11826#discussion_r411619829
##########
File path: docs/concepts/index.md
##########
@@ -27,20 +27,33 @@ specific language governing permissions and limitations
under the License.
-->
+The [Hands-on Tutorials]({{ site.baseurl }}{% link tutorials/index.md %})
explain the basic concepts
+of stateful and timely stream processing that underlie Flink's APIs, and
provide examples of how
+these mechanisms are used in applications. Stateful stream processing is
introduced in the context
+of [Data Pipelines & ETL]({{ site.baseurl }}{% link tutorials/etl.md
%}#stateful-transformations)
+and is further developed in the section on [Fault Tolerance]({{ site.baseurl
}}{% link
+tutorials/fault_tolerance.md %}). Timely stream processing is introduced in
the section on
+[Streaming Analytics]({{ site.baseurl }}{% link
tutorials/streaming_analytics.md %}).
+
+This _Concepts in Depth_ section provides a deeper understanding of how
Flink's architecture and runtime
+implement these concepts.
+
+## Flink's APIs
+
Flink offers different levels of abstraction for developing streaming/batch
applications.
<img src="{{ site.baseurl }}/fig/levels_of_abstraction.svg" alt="Programming
levels of abstraction" class="offset" width="80%" />
- - The lowest level abstraction simply offers **stateful streaming**. It is
+ - The lowest level abstraction simply offers **stateful and timely stream
processing**. It is
embedded into the [DataStream API]({{ site.baseurl}}{% link
dev/datastream_api.md %}) via the [Process Function]({{ site.baseurl }}{%
- link dev/stream/operators/process_function.md %}). It allows users freely
- process events from one or more streams, and use consistent fault tolerant
+ link dev/stream/operators/process_function.md %}). It allows users to
freely
+ process events from one or more streams, and provides consistent, fault
tolerant
*state*. In addition, users can register event time and processing time
callbacks, allowing programs to realize sophisticated computations.
- - In practice, most applications would not need the above described low level
- abstraction, but would instead program against the **Core APIs** like the
+ - In practice, many applications do not need the low level
Review comment:
Is it rather "low-level" as an adjective? (this occurs multiple times on
this page)
##########
File path: docs/concepts/index.md
##########
@@ -27,20 +27,33 @@ specific language governing permissions and limitations
under the License.
-->
+The [Hands-on Tutorials]({{ site.baseurl }}{% link tutorials/index.md %})
explain the basic concepts
+of stateful and timely stream processing that underlie Flink's APIs, and
provide examples of how
+these mechanisms are used in applications. Stateful stream processing is
introduced in the context
+of [Data Pipelines & ETL]({{ site.baseurl }}{% link tutorials/etl.md
%}#stateful-transformations)
+and is further developed in the section on [Fault Tolerance]({{ site.baseurl
}}{% link
+tutorials/fault_tolerance.md %}). Timely stream processing is introduced in
the section on
+[Streaming Analytics]({{ site.baseurl }}{% link
tutorials/streaming_analytics.md %}).
+
+This _Concepts in Depth_ section provides a deeper understanding of how
Flink's architecture and runtime
+implement these concepts.
+
+## Flink's APIs
+
Flink offers different levels of abstraction for developing streaming/batch
applications.
<img src="{{ site.baseurl }}/fig/levels_of_abstraction.svg" alt="Programming
levels of abstraction" class="offset" width="80%" />
- - The lowest level abstraction simply offers **stateful streaming**. It is
+ - The lowest level abstraction simply offers **stateful and timely stream
processing**. It is
embedded into the [DataStream API]({{ site.baseurl}}{% link
dev/datastream_api.md %}) via the [Process Function]({{ site.baseurl }}{%
- link dev/stream/operators/process_function.md %}). It allows users freely
- process events from one or more streams, and use consistent fault tolerant
+ link dev/stream/operators/process_function.md %}). It allows users to
freely
+ process events from one or more streams, and provides consistent, fault
tolerant
*state*. In addition, users can register event time and processing time
callbacks, allowing programs to realize sophisticated computations.
- - In practice, most applications would not need the above described low level
- abstraction, but would instead program against the **Core APIs** like the
+ - In practice, many applications do not need the low level
Review comment:
```suggestion
- In practice, many applications do not need the low-level
```
##########
File path: docs/concepts/index.md
##########
@@ -50,8 +63,8 @@ Flink offers different levels of abstraction for developing
streaming/batch appl
respective programming languages.
The low level *Process Function* integrates with the *DataStream API*,
- making it possible to go the lower level abstraction for certain operations
- only. The *DataSet API* offers additional primitives on bounded data sets,
+ making it possible to use the lower level abstraction on an as-needed
basis.
Review comment:
?
```suggestion
making it possible to use the lower-level abstraction on an as-needed
basis.
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
Review comment:
add link to http://github.com/apache/flink-training ?
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
Review comment:
You could actually link to the glossary here, but I'm not sure whether
it would confuse people if they actually followed the links and read the
information there (may be too much detail there); however, it may be useful
later.
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
Review comment:
nit: I usually try to spell out things like "it is" to simplify things
for non-native speakers...
```suggestion
Streams are data's natural habitat. Whether it is events from web servers,
trades from a stock
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
Review comment:
don't repeat "such as" again? Maybe replace with this?
```suggestion
distributed logs, like Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
Review comment:
Accessing state within Flink via a REST API? That's not really
available, is it?
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
Review comment:
or mention that links to the according exercises will be available where
needed?
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
Review comment:
```suggestion
it is possible, for example, to sort the data, compute global statistics, or
produce a final report
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
Review comment:
```suggestion
Often there is a one-to-one correspondence between the transformations in
the program and the
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel
dataflow" class="offset" width="80%" />
Review comment:
FYI: task vs. subtask should also be changed in the image if changed in
text (as proposed above)
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
Review comment:
Do you actually need the concept of the "parallelism of a stream"?
Because since there could be n->m streams, I find it difficult to just use "n"
here and also I rarely use parallelism for the stream itself...
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel
dataflow" class="offset" width="80%" />
+
+Streams can transport data between two operators in a *one-to-one* (or
+*forwarding*) pattern, or in a *redistributing* pattern:
+
+ - **One-to-one** streams (for example between the *Source* and the *map()*
+ operators in the figure above) preserve the partitioning and ordering of
+ the elements. That means that subtask[1] of the *map()* operator will see
+ the same elements in the same order as they were produced by subtask[1] of
+ the *Source* operator.
+
+ - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
+ well as between *keyBy/window* and *Sink*) change the partitioning of
+ streams. Each *operator subtask* sends data to different target subtasks,
+ depending on the selected transformation. Examples are *keyBy()* (which
+ re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
+ re-partitions randomly). In a *redistributing* exchange the ordering among
+ the elements is only preserved within each pair of sending and receiving
+ subtasks (for example, subtask[1] of *map()* and subtask[2] of
+ *keyBy/window*). So, for example, the redistribution between the
keyBy/window and
+ the Sink operators shown above introduces non-determinism regarding the
+ order in which the aggregated results for different keys arrive at the
Sink.
+
+{% top %}
+
+## Timely Stream Processing
+
+For most streaming applications it is very valuable to be able re-process
historic data with the
+same code that is used to process live data -- and to produce deterministic,
consistent results,
+regardless.
+
+It can also be crucial to pay attention to the order in which events occurred,
rather than the order
+in which they are delivered for processing, and to be able to reason about
when a set of events is
+(or should be) complete. For example, consider the set of events involved in
an e-commerce
+transaction, or financial trade.
+
+These requirements for timely stream processing can be met by using event time
timestamps that are
+recorded in the data stream, rather than using the clocks of the machines
processing the data.
+
+{% top %}
+
+## Stateful Stream Processing
+
+Flink's operations can be stateful. This means that how one event is handled
can depend on the
+accumulated effect of all the events that came before it. State may be used
for something simple,
+such as counting events per minute to display on a dashboard, or for something
more complex, such as
+computing features for a fraud detection model.
+
+A Flink application is run in parallel on a distributed cluster. The various
parallel instances of a
+given operator will execute independently, in separate threads, and in general
will be running on
+different machines.
+
+The set of parallel instances of a stateful operator is effectively a sharded
key-value store. Each
+parallel instance is responsible for handling events for a specific group of
keys, and the state for
+those keys is kept locally.
+
+The diagram below shows a job running with a parallelism of two across the
first three operators in
+the job graph, terminating in a sink that has a parallelism of one. The third
operator is stateful,
+and you can see that a fully connected network shuffle is occurring between
the second and third
+operators. This is being done to partition the stream by some key, so that all
of the events that
+need to be processed together, will be.
+
+<img src="{{ site.baseurl }}/fig/parallel-job.png" alt="State is sharded"
class="offset" width="65%" />
+
+State is always accessed locally, which helps Flink applications achieve high
throughput and
+low-latency. You can choose to keep state on the JVM heap, or if it is too
large, in efficiently
+organized on-disk data structures.
Review comment:
```suggestion
low-latency. You can choose to keep state on the JVM heap, or if it is too
large, in
efficiently-organized on-disk data structures.
```
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel
dataflow" class="offset" width="80%" />
+
+Streams can transport data between two operators in a *one-to-one* (or
+*forwarding*) pattern, or in a *redistributing* pattern:
+
+ - **One-to-one** streams (for example between the *Source* and the *map()*
+ operators in the figure above) preserve the partitioning and ordering of
+ the elements. That means that subtask[1] of the *map()* operator will see
+ the same elements in the same order as they were produced by subtask[1] of
+ the *Source* operator.
+
+ - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
+ well as between *keyBy/window* and *Sink*) change the partitioning of
+ streams. Each *operator subtask* sends data to different target subtasks,
+ depending on the selected transformation. Examples are *keyBy()* (which
+ re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
+ re-partitions randomly). In a *redistributing* exchange the ordering among
+ the elements is only preserved within each pair of sending and receiving
+ subtasks (for example, subtask[1] of *map()* and subtask[2] of
+ *keyBy/window*). So, for example, the redistribution between the
keyBy/window and
+ the Sink operators shown above introduces non-determinism regarding the
+ order in which the aggregated results for different keys arrive at the
Sink.
+
+{% top %}
+
+## Timely Stream Processing
+
+For most streaming applications it is very valuable to be able re-process
historic data with the
+same code that is used to process live data -- and to produce deterministic,
consistent results,
+regardless.
+
+It can also be crucial to pay attention to the order in which events occurred,
rather than the order
+in which they are delivered for processing, and to be able to reason about
when a set of events is
+(or should be) complete. For example, consider the set of events involved in
an e-commerce
+transaction, or financial trade.
+
+These requirements for timely stream processing can be met by using event time
timestamps that are
+recorded in the data stream, rather than using the clocks of the machines
processing the data.
+
+{% top %}
+
+## Stateful Stream Processing
+
+Flink's operations can be stateful. This means that how one event is handled
can depend on the
+accumulated effect of all the events that came before it. State may be used
for something simple,
+such as counting events per minute to display on a dashboard, or for something
more complex, such as
+computing features for a fraud detection model.
+
+A Flink application is run in parallel on a distributed cluster. The various
parallel instances of a
+given operator will execute independently, in separate threads, and in general
will be running on
+different machines.
+
+The set of parallel instances of a stateful operator is effectively a sharded
key-value store. Each
+parallel instance is responsible for handling events for a specific group of
keys, and the state for
+those keys is kept locally.
+
+The diagram below shows a job running with a parallelism of two across the
first three operators in
+the job graph, terminating in a sink that has a parallelism of one. The third
operator is stateful,
+and you can see that a fully connected network shuffle is occurring between
the second and third
+operators. This is being done to partition the stream by some key, so that all
of the events that
+need to be processed together, will be.
+
+<img src="{{ site.baseurl }}/fig/parallel-job.png" alt="State is sharded"
class="offset" width="65%" />
+
+State is always accessed locally, which helps Flink applications achieve high
throughput and
+low-latency. You can choose to keep state on the JVM heap, or if it is too
large, in efficiently
+organized on-disk data structures.
+
+<img src="{{ site.baseurl }}/fig/local-state.png" alt="State is local"
class="offset" width="90%" />
+
+{% top %}
+
+## Fault Tolerance via State Snapshots
+
+Flink is able to provide fault-tolerant, exactly-once semantics through a
combination of state
+snapshots and stream replay. These snapshots capture the entire state of the
distributed pipeline,
+recording offsets into the input queues as well as the state throughout the
job graph that has
+resulted from having ingested the data up to that point. When a failure
occurs, the sources are
+rewound, the state is restored, and processing is resumed. As depicted above,
these state snapshots
+are captured asynchronously, without impeding the ongoing processing.
Review comment:
"Understatement of the year" ;)
But as for the training and as an introduction, it is fine to keep away the
details here.
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
Review comment:
This code example should be updated:
* not using `keyBy("id")` based on a string - bad practice and may also be
removed in the future
* not using the `BucketingSink` (this line also has a rendering error)
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
Review comment:
the new term for "subtask" is "task", but, according to the
[glossary](https://ci.apache.org/projects/flink/flink-docs-master/concepts/glossary.html#sub-task),
a "Sub-Task" is the same but "emphasizes that there are multiple parallel
Tasks for the same Operator or Operator Chain". I think, we should use "Task"
here (not sure about capitalization).
##########
File path: docs/tutorials/index.md
##########
@@ -0,0 +1,186 @@
+---
+title: Hands-on Tutorials
+nav-id: tutorials
+nav-pos: 2
+nav-title: '<i class="fa fa-hand-paper-o title appetizer"
aria-hidden="true"></i> Hands-on Tutorials'
+nav-parent_id: root
+nav-show_overview: true
+always-expand: true
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Goals and Scope of these Tutorials
+
+These tutorials present an introduction to Apache Flink that includes just
enough to get you started
+writing scalable streaming ETL, analytics, and event-driven applications,
while leaving out a lot of
+(ultimately important) details. The focus is on providing straightforward
introductions to Flink's
+APIs for managing state and time, with the expectation that having mastered
these fundamentals,
+you'll be much better equipped to pick up the rest of what you need to know
from the more detailed
+reference documentation. The links at the end of each page will lead you to
where you can learn
+more.
+
+Specifically, you will learn:
+
+- how to implement streaming data processing pipelines
+- how and why Flink manages state
+- how to use event time to consistently compute accurate analytics
+- how to build event-driven applications on continuous streams
+- how Flink is able to provide fault-tolerant, stateful stream processing with
exactly-once semantics
+
+These tutorials focus on four critical concepts: continuous processing of
streaming data, event
+time, stateful stream processing, and state snapshots. This page introduces
these concepts.
+
+{% info Note %} Accompanying these tutorials are a set of hands-on exercises
that will guide you
+through learning how to work with the concepts being presented.
+
+{% top %}
+
+## Stream Processing
+
+Streams are data's natural habitat. Whether it's events from web servers,
trades from a stock
+exchange, or sensor readings from a machine on a factory floor, data is
created as part of a stream.
+But when you analyze data, you can either organize your processing around
_bounded_ or _unbounded_
+streams, and which of these paradigms you choose has profound consequences.
+
+<img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and
unbounded streams" class="offset" width="90%" />
+
+**Batch processing** is the paradigm at work when you process a bounded data
stream. In this mode of
+operation you can choose to ingest the entire dataset before producing any
results, which means that
+it's possible, for example, to sort the data, compute global statistics, or
produce a final report
+that summarizes all of the input.
+
+**Stream processing**, on the other hand, involves unbounded data streams.
Conceptually, at least,
+the input may never end, and so you are forced to continuously process the
data as it arrives.
+
+In Flink, applications are composed of **streaming dataflows** that may be
transformed by
+user-defined **operators**. These dataflows form directed graphs that start
with one or more
+**sources**, and end in one or more **sinks**.
+
+<img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream
program, and its dataflow." class="offset" width="80%" />
+
+Often there is a one-to-one correspondence between the transformations in the
programs and the
+operators in the dataflow. Sometimes, however, one transformation may consist
of multiple operators.
+
+An application may consume real-time data from streaming sources such as
message queues or
+distributed logs, such as Apache Kafka or Kinesis. But flink can also consume
bounded, historic data
+from a variety of data sources. Similarly, the streams of results being
produced by a Flink
+application can be sent to a wide variety of systems, and the state held
within Flink can be
+accessed via a REST API.
+
+<img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png"
alt="Flink application with sources and sinks" class="offset" width="90%" />
+
+### Parallel Dataflows
+
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
+
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
+
+<img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel
dataflow" class="offset" width="80%" />
+
+Streams can transport data between two operators in a *one-to-one* (or
+*forwarding*) pattern, or in a *redistributing* pattern:
+
+ - **One-to-one** streams (for example between the *Source* and the *map()*
+ operators in the figure above) preserve the partitioning and ordering of
+ the elements. That means that subtask[1] of the *map()* operator will see
+ the same elements in the same order as they were produced by subtask[1] of
+ the *Source* operator.
+
+ - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
+ well as between *keyBy/window* and *Sink*) change the partitioning of
+ streams. Each *operator subtask* sends data to different target subtasks,
+ depending on the selected transformation. Examples are *keyBy()* (which
+ re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
+ re-partitions randomly). In a *redistributing* exchange the ordering among
+ the elements is only preserved within each pair of sending and receiving
+ subtasks (for example, subtask[1] of *map()* and subtask[2] of
+ *keyBy/window*). So, for example, the redistribution between the
keyBy/window and
+ the Sink operators shown above introduces non-determinism regarding the
+ order in which the aggregated results for different keys arrive at the
Sink.
+
+{% top %}
+
+## Timely Stream Processing
+
+For most streaming applications it is very valuable to be able re-process
historic data with the
+same code that is used to process live data -- and to produce deterministic,
consistent results,
+regardless.
+
+It can also be crucial to pay attention to the order in which events occurred,
rather than the order
+in which they are delivered for processing, and to be able to reason about
when a set of events is
+(or should be) complete. For example, consider the set of events involved in
an e-commerce
+transaction, or financial trade.
+
+These requirements for timely stream processing can be met by using event time
timestamps that are
+recorded in the data stream, rather than using the clocks of the machines
processing the data.
+
+{% top %}
+
+## Stateful Stream Processing
+
+Flink's operations can be stateful. This means that how one event is handled
can depend on the
+accumulated effect of all the events that came before it. State may be used
for something simple,
+such as counting events per minute to display on a dashboard, or for something
more complex, such as
+computing features for a fraud detection model.
+
+A Flink application is run in parallel on a distributed cluster. The various
parallel instances of a
+given operator will execute independently, in separate threads, and in general
will be running on
+different machines.
+
+The set of parallel instances of a stateful operator is effectively a sharded
key-value store. Each
+parallel instance is responsible for handling events for a specific group of
keys, and the state for
+those keys is kept locally.
+
+The diagram below shows a job running with a parallelism of two across the
first three operators in
+the job graph, terminating in a sink that has a parallelism of one. The third
operator is stateful,
+and you can see that a fully connected network shuffle is occurring between
the second and third
Review comment:
?
```suggestion
and you can see that a fully-connected network shuffle is occurring between
the second and third
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]