slinkydeveloper commented on a change in pull request #19107:
URL: https://github.com/apache/flink/pull/19107#discussion_r827936778
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
Review comment:
maybe link to logical graph directly, so you skip a link hop?
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
Review comment:
I would modify this and generify a bit between apis, something like this:
```suggestion
The various Flink APIs provides different ways to specify how to extract the
event time from the event instances, such as TimestampExtractor for DataStream
APIs and timestamp column for SQL/Table API.
```
And add the links, as @knaufk suggested
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
Review comment:
This comes a bit out of the blue, in particular "you should use" doesn't
really specify the context. Maybe reword to something like:
```suggestion
When developing streaming applications, it's a good practice that the source
of the event attaches the event time to the event, in order for the stream
processor to achieve reproducible results that does not depend on when the
calculation is performed.
```
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
+exactly as many key groups as the defined maximum parallelism. During
execution, each parallel instance
+of a keyed operator works with the keys for one or more key groups.
+
+#### Keyed State
+
+Keyed state is one of the two basic types of state in Apache Flink (the other
being operator state).
Review comment:
Link the operator state
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
Review comment:
I would add a sentence first like:
```suggestion
A disjoint subset of the universe of keys, for a given key type, spanning a
certain range of keys. Flink creates a [keyed state](#keyed-state) bucket for
each key group, always assigning the state of a specific key to a specific
bucket. There are
```
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
+exactly as many key groups as the defined maximum parallelism. During
execution, each parallel instance
+of a keyed operator works with the keys for one or more key groups.
+
+#### Keyed State
+
+Keyed state is one of the two basic types of state in Apache Flink (the other
being operator state).
+In order to have all events with the same value of an attribute grouped
together, you can partition
+a stream around that attribute, and maintain it as an embedded key/value
store. This results in a keyed
+state.
+
+A keyed state is always bound to keys and is only available to functions and
operators that process
+data from a keyed stream.
+
+Flink supports several different types of keyed state, with the simplest one
being [ValueState](#valuestate).
+
+#### Keyed Stream
+
+A keyed stream is a [DataStream](#DataStream) on which [operator
state](#operator-state) is partitioned
+by a key. Typical operations supported by a DataStream are also possible on a
keyed stream, except for
+partitioning methods such as shuffle, forward, and keyBy.
+
+#### Lateness
+
+Lateness is defined relative to the [watermarks](#watermark). A watermark(t)
asserts that the stream
+is complete up through to time t. Any event is considered late if it comes
after the watermark whose
+timestamp is ≤ t.
+
+#### ListState<T>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of elements.
You can append elements
+and retrieve an Iterable over all currently stored elements. Elements are
added using add(T) or
+addAll(List<T>). The Iterable can be retrieved using Iterable<T> get().
#### Logical Graph
-A logical graph is a directed graph where the nodes are [Operators](#operator)
-and the edges define input/output-relationships of the operators and correspond
-to data streams or data sets. A logical graph is created by submitting jobs
-from a [Flink Application](#flink-application).
+This is a directed graph where the nodes are [operators](#operator) and the
edges define input/output
+relationships of the operators and correspond to [DataStreams](#datastreams).
A logical graph is created
+by submitting jobs to a [Flink cluster](#(flink)-cluster) from a [Flink
application](#(flink)-application).
-Logical graphs are also often referred to as *dataflow graphs*.
+Logical graphs are also often referred to as [dataflow](#dataflow).
#### Managed State
-Managed State describes application state which has been registered with the
framework. For
-Managed State, Apache Flink will take care about persistence and rescaling
among other things.
+Managed state is application state which has been registered with the stream
processing framework,
+which will take care of the persistence and rescaling of this state.
+
+This type of state is represented in data structures controlled by the Flink
runtime, such as internal
+hash tables, or RocksDB. Flink’s runtime encodes the states and writes them
into the checkpoints.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: managed and [raw](#raw-state).
+
+#### MapState<UK, UV>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of mappings.
You can put key-value
+pairs into the state and retrieve an Iterable over all currently stored
mappings. Mappings are added
+using put(UK, UV) or putAll(Map<UK, UV>). The value associated with a key can
be retrieved using get(UK).
+
+#### Non-keyed State
+
+This type of state is bound to one parallel operator instance and is also
called [operator state](#operator-state).
+
+It is possible to work with [managed state](#managed-state) in non-keyed
contexts but it is unusual
+for user-defined functions to need non-keyed state and the interfaces involved
would be different.
+
+This feature is most often used in the implementation of [sources](#source)
and [sinks](#sink).
+
+#### Offset
+
+A number identifying how far you are from the beginning of a certain
[DataStream](#datastream).
#### Operator
-Node of a [Logical Graph](#logical-graph). An Operator performs a certain
operation, which is
-usually executed by a [Function](#function). Sources and Sinks are special
Operators for data
+An operator is a node of a [logical graph](#logical-graph). An operator
performs a certain operation,
+which is usually executed by a [function](#function). Sources and sinks are
special operators for data
ingestion and data egress.
#### Operator Chain
-An Operator Chain consists of two or more consecutive [Operators](#operator)
without any
-repartitioning in between. Operators within the same Operator Chain forward
records to each other
-directly without going through serialization or Flink's network stack.
+An operator chain consists of two or more consecutive [operators](#operator)
without any
+repartitioning in between. Operators within the same operator chain forward
records to each other
+directly without going through serialization or Flink's network stack. This is
a useful optimization
+and increases overall throughput while decreasing latency. The chaining
behavior can be configured.
+
+#### Operator State
+
+See [non-keyed state](#non-keyed-state).
+
+#### Parallelism
+
+This is a technique for making programs run faster by performing several
computations simultaneously.
#### Partition
-A partition is an independent subset of the overall data stream or data set. A
data stream or
-data set is divided into partitions by assigning each [record](#Record) to one
or more partitions.
-Partitions of data streams or data sets are consumed by [Tasks](#task) during
runtime. A
-transformation which changes the way a data stream or data set is partitioned
is often called
-repartitioning.
+A partition is an independent subset of the overall [DataStream](#datastream).
A DataStream is divided
+into partitions by assigning each [record](#record) to one or more partitions
via keys. Partitions of
+DataStreams are consumed by [tasks](#task) during runtime. A transformation
that changes the way a
+DataStream is partitioned is often called repartitioning.
#### Physical Graph
-A physical graph is the result of translating a [Logical
Graph](#logical-graph) for execution in a
-distributed runtime. The nodes are [Tasks](#task) and the edges indicate
input/output-relationships
-or [partitions](#partition) of data streams or data sets.
+A physical graph is the result of translating a [logical
graph](#logical-graph) for execution in a
+distributed runtime. The nodes are [tasks](#task) and the edges indicate
input/output relationships
+or [partitions](#partition) of DataStreams.
+
+#### POJO
+
+This is a composite data type and can be serialized with Flink's serializer.
Flink recognizes a data
+type as a POJO type (and allows “by-name” field referencing) if the following
conditions are met:
+
+- the class is public and standalone (no non-static inner class)
+- the class has a public no-argument constructor
+- all non-static, non-transient fields in the class (and all superclasses) are
either public (and
+ non-final) or have public getter- and setter- methods that follow the Java
naming conventions for
+ getters and setters
+
+Flink analyzes the structure of POJO types and can process POJOs more
efficiently than general types.
+
+#### Process Functions
+
+This type of function combines event processing with timers and state and is
the basis for creating
+event-driven applications with Flink.
+
+#### Processing Time
+
+The time when a specific operator in your pipeline is processing the event.
Computing analytics based
+on processing time can cause inconsistencies and make it difficult to
re-analyze historic data or test
+new implementations.
+
+#### Queryable State
+
+This is managed keyed (partitioned) state that can be accessed from outside of
Flink during runtime.
+
+#### Raw State
+
+This is state that operators keep in their own data structures. When
checkpointed, only a sequence of
+bytes is written into the checkpoint and Flink knows nothing about the state’s
data structures and will
+see only the raw bytes.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: [managed](#managed-state) and raw.
#### Record
-Records are the constituent elements of a data set or data stream.
[Operators](#operator) and
-[Functions](#Function) receive records as input and emit records as output.
+Records are the elements that make up a [DataStream](#datastream).
[Operators](#operator) and [functions](#function)
+receive records as input and emit records as output.
+
+#### ResourceManager
+
+This is part of the [JobManager](#JobManager) and is responsible for resource
de-/allocation and
+provisioning in a Flink cluster.
+
+#### Rich Functions
+
+A RichFunction is a "rich" variant of Flink's function interfaces for data
transformation. These functions
+have some additional methods needed for working with managed keyed state such
as `open(Configuration c)`,
+`close()`, `getRuntimeContext()`.
+
+#### Rolling Total
+
+The sum of a sequence of numbers which is updated each time a new number is
added to the sequence,
+by adding the value of the new number to the previous rolling total.
#### (Runtime) Execution Mode
-DataStream API programs can be executed in one of two execution modes: `BATCH`
-or `STREAMING`. See [Execution Mode]({{< ref
"/docs/dev/datastream/execution_mode" >}}) for more details.
+DataStream API programs can be executed in one of two execution modes: `BATCH`
or `STREAMING`.
+See [Execution Mode]({{< ref "/docs/dev/datastream/execution_mode" >}}) for
more details.
+
+#### Savepoint
+
+A [snapshot](#snapshot) triggered manually by a user (or an API call) for some
operational purpose,
+such as a stateful redeploy/upgrade/rescaling. Savepoints are always complete
and aligned and are
+optimized for operational flexibility.
+
+#### Scalar
+
+A scalar refers to a single value. This is in contrast to a set of values.
+
+#### Schema
+
+This refers to the organization or structure of data as a blueprint.
Review comment:
I would remove this one, this is an informatics known concept and there
is nothing specific to add here in relation to Flink
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
+exactly as many key groups as the defined maximum parallelism. During
execution, each parallel instance
+of a keyed operator works with the keys for one or more key groups.
+
+#### Keyed State
+
+Keyed state is one of the two basic types of state in Apache Flink (the other
being operator state).
+In order to have all events with the same value of an attribute grouped
together, you can partition
+a stream around that attribute, and maintain it as an embedded key/value
store. This results in a keyed
+state.
+
+A keyed state is always bound to keys and is only available to functions and
operators that process
+data from a keyed stream.
+
+Flink supports several different types of keyed state, with the simplest one
being [ValueState](#valuestate).
+
+#### Keyed Stream
+
+A keyed stream is a [DataStream](#DataStream) on which [operator
state](#operator-state) is partitioned
+by a key. Typical operations supported by a DataStream are also possible on a
keyed stream, except for
+partitioning methods such as shuffle, forward, and keyBy.
+
+#### Lateness
+
+Lateness is defined relative to the [watermarks](#watermark). A watermark(t)
asserts that the stream
+is complete up through to time t. Any event is considered late if it comes
after the watermark whose
+timestamp is ≤ t.
+
+#### ListState<T>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of elements.
You can append elements
+and retrieve an Iterable over all currently stored elements. Elements are
added using add(T) or
+addAll(List<T>). The Iterable can be retrieved using Iterable<T> get().
#### Logical Graph
-A logical graph is a directed graph where the nodes are [Operators](#operator)
-and the edges define input/output-relationships of the operators and correspond
-to data streams or data sets. A logical graph is created by submitting jobs
-from a [Flink Application](#flink-application).
+This is a directed graph where the nodes are [operators](#operator) and the
edges define input/output
+relationships of the operators and correspond to [DataStreams](#datastreams).
A logical graph is created
+by submitting jobs to a [Flink cluster](#(flink)-cluster) from a [Flink
application](#(flink)-application).
-Logical graphs are also often referred to as *dataflow graphs*.
+Logical graphs are also often referred to as [dataflow](#dataflow).
#### Managed State
-Managed State describes application state which has been registered with the
framework. For
-Managed State, Apache Flink will take care about persistence and rescaling
among other things.
+Managed state is application state which has been registered with the stream
processing framework,
+which will take care of the persistence and rescaling of this state.
+
+This type of state is represented in data structures controlled by the Flink
runtime, such as internal
+hash tables, or RocksDB. Flink’s runtime encodes the states and writes them
into the checkpoints.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: managed and [raw](#raw-state).
+
+#### MapState<UK, UV>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of mappings.
You can put key-value
+pairs into the state and retrieve an Iterable over all currently stored
mappings. Mappings are added
+using put(UK, UV) or putAll(Map<UK, UV>). The value associated with a key can
be retrieved using get(UK).
+
+#### Non-keyed State
+
+This type of state is bound to one parallel operator instance and is also
called [operator state](#operator-state).
+
+It is possible to work with [managed state](#managed-state) in non-keyed
contexts but it is unusual
+for user-defined functions to need non-keyed state and the interfaces involved
would be different.
+
+This feature is most often used in the implementation of [sources](#source)
and [sinks](#sink).
+
+#### Offset
+
+A number identifying how far you are from the beginning of a certain
[DataStream](#datastream).
#### Operator
-Node of a [Logical Graph](#logical-graph). An Operator performs a certain
operation, which is
-usually executed by a [Function](#function). Sources and Sinks are special
Operators for data
+An operator is a node of a [logical graph](#logical-graph). An operator
performs a certain operation,
+which is usually executed by a [function](#function). Sources and sinks are
special operators for data
ingestion and data egress.
#### Operator Chain
-An Operator Chain consists of two or more consecutive [Operators](#operator)
without any
-repartitioning in between. Operators within the same Operator Chain forward
records to each other
-directly without going through serialization or Flink's network stack.
+An operator chain consists of two or more consecutive [operators](#operator)
without any
+repartitioning in between. Operators within the same operator chain forward
records to each other
+directly without going through serialization or Flink's network stack. This is
a useful optimization
+and increases overall throughput while decreasing latency. The chaining
behavior can be configured.
+
+#### Operator State
+
+See [non-keyed state](#non-keyed-state).
+
+#### Parallelism
+
+This is a technique for making programs run faster by performing several
computations simultaneously.
#### Partition
-A partition is an independent subset of the overall data stream or data set. A
data stream or
-data set is divided into partitions by assigning each [record](#Record) to one
or more partitions.
-Partitions of data streams or data sets are consumed by [Tasks](#task) during
runtime. A
-transformation which changes the way a data stream or data set is partitioned
is often called
-repartitioning.
+A partition is an independent subset of the overall [DataStream](#datastream).
A DataStream is divided
+into partitions by assigning each [record](#record) to one or more partitions
via keys. Partitions of
+DataStreams are consumed by [tasks](#task) during runtime. A transformation
that changes the way a
+DataStream is partitioned is often called repartitioning.
#### Physical Graph
-A physical graph is the result of translating a [Logical
Graph](#logical-graph) for execution in a
-distributed runtime. The nodes are [Tasks](#task) and the edges indicate
input/output-relationships
-or [partitions](#partition) of data streams or data sets.
+A physical graph is the result of translating a [logical
graph](#logical-graph) for execution in a
+distributed runtime. The nodes are [tasks](#task) and the edges indicate
input/output relationships
+or [partitions](#partition) of DataStreams.
+
+#### POJO
+
+This is a composite data type and can be serialized with Flink's serializer.
Flink recognizes a data
+type as a POJO type (and allows “by-name” field referencing) if the following
conditions are met:
+
+- the class is public and standalone (no non-static inner class)
+- the class has a public no-argument constructor
+- all non-static, non-transient fields in the class (and all superclasses) are
either public (and
+ non-final) or have public getter- and setter- methods that follow the Java
naming conventions for
+ getters and setters
+
+Flink analyzes the structure of POJO types and can process POJOs more
efficiently than general types.
+
+#### Process Functions
+
+This type of function combines event processing with timers and state and is
the basis for creating
+event-driven applications with Flink.
+
+#### Processing Time
+
+The time when a specific operator in your pipeline is processing the event.
Computing analytics based
+on processing time can cause inconsistencies and make it difficult to
re-analyze historic data or test
+new implementations.
+
+#### Queryable State
+
+This is managed keyed (partitioned) state that can be accessed from outside of
Flink during runtime.
+
+#### Raw State
+
+This is state that operators keep in their own data structures. When
checkpointed, only a sequence of
+bytes is written into the checkpoint and Flink knows nothing about the state’s
data structures and will
+see only the raw bytes.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: [managed](#managed-state) and raw.
#### Record
-Records are the constituent elements of a data set or data stream.
[Operators](#operator) and
-[Functions](#Function) receive records as input and emit records as output.
+Records are the elements that make up a [DataStream](#datastream).
[Operators](#operator) and [functions](#function)
+receive records as input and emit records as output.
+
+#### ResourceManager
+
+This is part of the [JobManager](#JobManager) and is responsible for resource
de-/allocation and
+provisioning in a Flink cluster.
+
+#### Rich Functions
+
+A RichFunction is a "rich" variant of Flink's function interfaces for data
transformation. These functions
+have some additional methods needed for working with managed keyed state such
as `open(Configuration c)`,
+`close()`, `getRuntimeContext()`.
Review comment:
This is very datastream specific, i would not have it in the glossary
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
+exactly as many key groups as the defined maximum parallelism. During
execution, each parallel instance
+of a keyed operator works with the keys for one or more key groups.
+
+#### Keyed State
+
+Keyed state is one of the two basic types of state in Apache Flink (the other
being operator state).
+In order to have all events with the same value of an attribute grouped
together, you can partition
+a stream around that attribute, and maintain it as an embedded key/value
store. This results in a keyed
+state.
+
+A keyed state is always bound to keys and is only available to functions and
operators that process
+data from a keyed stream.
+
+Flink supports several different types of keyed state, with the simplest one
being [ValueState](#valuestate).
+
+#### Keyed Stream
+
+A keyed stream is a [DataStream](#DataStream) on which [operator
state](#operator-state) is partitioned
+by a key. Typical operations supported by a DataStream are also possible on a
keyed stream, except for
+partitioning methods such as shuffle, forward, and keyBy.
+
+#### Lateness
+
+Lateness is defined relative to the [watermarks](#watermark). A watermark(t)
asserts that the stream
+is complete up through to time t. Any event is considered late if it comes
after the watermark whose
+timestamp is ≤ t.
+
+#### ListState<T>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of elements.
You can append elements
+and retrieve an Iterable over all currently stored elements. Elements are
added using add(T) or
+addAll(List<T>). The Iterable can be retrieved using Iterable<T> get().
#### Logical Graph
-A logical graph is a directed graph where the nodes are [Operators](#operator)
-and the edges define input/output-relationships of the operators and correspond
-to data streams or data sets. A logical graph is created by submitting jobs
-from a [Flink Application](#flink-application).
+This is a directed graph where the nodes are [operators](#operator) and the
edges define input/output
+relationships of the operators and correspond to [DataStreams](#datastreams).
A logical graph is created
+by submitting jobs to a [Flink cluster](#(flink)-cluster) from a [Flink
application](#(flink)-application).
-Logical graphs are also often referred to as *dataflow graphs*.
+Logical graphs are also often referred to as [dataflow](#dataflow).
#### Managed State
-Managed State describes application state which has been registered with the
framework. For
-Managed State, Apache Flink will take care about persistence and rescaling
among other things.
+Managed state is application state which has been registered with the stream
processing framework,
+which will take care of the persistence and rescaling of this state.
+
+This type of state is represented in data structures controlled by the Flink
runtime, such as internal
+hash tables, or RocksDB. Flink’s runtime encodes the states and writes them
into the checkpoints.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: managed and [raw](#raw-state).
+
+#### MapState<UK, UV>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of mappings.
You can put key-value
+pairs into the state and retrieve an Iterable over all currently stored
mappings. Mappings are added
+using put(UK, UV) or putAll(Map<UK, UV>). The value associated with a key can
be retrieved using get(UK).
+
+#### Non-keyed State
+
+This type of state is bound to one parallel operator instance and is also
called [operator state](#operator-state).
+
+It is possible to work with [managed state](#managed-state) in non-keyed
contexts but it is unusual
+for user-defined functions to need non-keyed state and the interfaces involved
would be different.
+
+This feature is most often used in the implementation of [sources](#source)
and [sinks](#sink).
+
+#### Offset
+
+A number identifying how far you are from the beginning of a certain
[DataStream](#datastream).
#### Operator
-Node of a [Logical Graph](#logical-graph). An Operator performs a certain
operation, which is
-usually executed by a [Function](#function). Sources and Sinks are special
Operators for data
+An operator is a node of a [logical graph](#logical-graph). An operator
performs a certain operation,
+which is usually executed by a [function](#function). Sources and sinks are
special operators for data
ingestion and data egress.
#### Operator Chain
-An Operator Chain consists of two or more consecutive [Operators](#operator)
without any
-repartitioning in between. Operators within the same Operator Chain forward
records to each other
-directly without going through serialization or Flink's network stack.
+An operator chain consists of two or more consecutive [operators](#operator)
without any
+repartitioning in between. Operators within the same operator chain forward
records to each other
+directly without going through serialization or Flink's network stack. This is
a useful optimization
+and increases overall throughput while decreasing latency. The chaining
behavior can be configured.
+
+#### Operator State
+
+See [non-keyed state](#non-keyed-state).
+
+#### Parallelism
+
+This is a technique for making programs run faster by performing several
computations simultaneously.
#### Partition
-A partition is an independent subset of the overall data stream or data set. A
data stream or
-data set is divided into partitions by assigning each [record](#Record) to one
or more partitions.
-Partitions of data streams or data sets are consumed by [Tasks](#task) during
runtime. A
-transformation which changes the way a data stream or data set is partitioned
is often called
-repartitioning.
+A partition is an independent subset of the overall [DataStream](#datastream).
A DataStream is divided
+into partitions by assigning each [record](#record) to one or more partitions
via keys. Partitions of
+DataStreams are consumed by [tasks](#task) during runtime. A transformation
that changes the way a
+DataStream is partitioned is often called repartitioning.
#### Physical Graph
-A physical graph is the result of translating a [Logical
Graph](#logical-graph) for execution in a
-distributed runtime. The nodes are [Tasks](#task) and the edges indicate
input/output-relationships
-or [partitions](#partition) of data streams or data sets.
+A physical graph is the result of translating a [logical
graph](#logical-graph) for execution in a
+distributed runtime. The nodes are [tasks](#task) and the edges indicate
input/output relationships
+or [partitions](#partition) of DataStreams.
+
+#### POJO
+
+This is a composite data type and can be serialized with Flink's serializer.
Flink recognizes a data
+type as a POJO type (and allows “by-name” field referencing) if the following
conditions are met:
+
+- the class is public and standalone (no non-static inner class)
+- the class has a public no-argument constructor
+- all non-static, non-transient fields in the class (and all superclasses) are
either public (and
+ non-final) or have public getter- and setter- methods that follow the Java
naming conventions for
+ getters and setters
+
+Flink analyzes the structure of POJO types and can process POJOs more
efficiently than general types.
+
+#### Process Functions
+
+This type of function combines event processing with timers and state and is
the basis for creating
+event-driven applications with Flink.
+
+#### Processing Time
+
+The time when a specific operator in your pipeline is processing the event.
Computing analytics based
+on processing time can cause inconsistencies and make it difficult to
re-analyze historic data or test
+new implementations.
+
+#### Queryable State
+
+This is managed keyed (partitioned) state that can be accessed from outside of
Flink during runtime.
+
+#### Raw State
+
+This is state that operators keep in their own data structures. When
checkpointed, only a sequence of
+bytes is written into the checkpoint and Flink knows nothing about the state’s
data structures and will
+see only the raw bytes.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: [managed](#managed-state) and raw.
#### Record
-Records are the constituent elements of a data set or data stream.
[Operators](#operator) and
-[Functions](#Function) receive records as input and emit records as output.
+Records are the elements that make up a [DataStream](#datastream).
[Operators](#operator) and [functions](#function)
+receive records as input and emit records as output.
+
+#### ResourceManager
+
+This is part of the [JobManager](#JobManager) and is responsible for resource
de-/allocation and
+provisioning in a Flink cluster.
+
+#### Rich Functions
+
+A RichFunction is a "rich" variant of Flink's function interfaces for data
transformation. These functions
+have some additional methods needed for working with managed keyed state such
as `open(Configuration c)`,
+`close()`, `getRuntimeContext()`.
+
+#### Rolling Total
+
+The sum of a sequence of numbers which is updated each time a new number is
added to the sequence,
+by adding the value of the new number to the previous rolling total.
#### (Runtime) Execution Mode
-DataStream API programs can be executed in one of two execution modes: `BATCH`
-or `STREAMING`. See [Execution Mode]({{< ref
"/docs/dev/datastream/execution_mode" >}}) for more details.
+DataStream API programs can be executed in one of two execution modes: `BATCH`
or `STREAMING`.
+See [Execution Mode]({{< ref "/docs/dev/datastream/execution_mode" >}}) for
more details.
+
+#### Savepoint
+
+A [snapshot](#snapshot) triggered manually by a user (or an API call) for some
operational purpose,
+such as a stateful redeploy/upgrade/rescaling. Savepoints are always complete
and aligned and are
+optimized for operational flexibility.
+
+#### Scalar
+
+A scalar refers to a single value. This is in contrast to a set of values.
+
+#### Schema
+
+This refers to the organization or structure of data as a blueprint.
+
+#### Serialization
+
+This is the process of turning a data element in memory into a stream of bytes
so that you can more
+efficiently store it on disk or send it over the network.
Review comment:
remove _more efficiently_. Without serialization, you can't write at all
on disk or over the network
##########
File path: docs/content/docs/concepts/glossary.md
##########
@@ -25,182 +25,605 @@ under the License.
# Glossary
+#### Aggregation
+
+Aggregation is an operation that takes multiple values and returns a single
value. When working with
+streams, it generally makes more sense to think in terms of aggregations over
finite windows, rather
+than over the entire stream.
+
+#### (Flink) Application
+
+A Flink application is any user program that submits one or multiple [Flink
Jobs](#flink-job) from its
+`main()` method. The execution of these jobs can happen in a local JVM or on a
remote setup of clusters
+with multiple machines.
+
+The jobs of an application can either be submitted to a long-running [Session
Cluster](#session-cluster),
+to a dedicated [Application Cluster](#application-cluster), or to a [Job
Cluster](#job-cluster).
+
+#### Application Cluster
+
+A Flink application cluster is a dedicated [Flink cluster](#(flink)-cluster)
that only executes
+[Flink jobs](#flink-job) from one [Flink application](#(flink)-application).
The lifetime of the Flink
+cluster is bound to the lifetime of the Flink application.
+
+#### Asynchronous Snapshotting
+
+A form of [snapshotting](#snapshot) that doesn't impede the ongoing stream
processing by allowing an
+operator to continue processing while it stores its state snapshot,
effectively letting the state
+snapshots happen asynchronously in the background.
+
+#### At-least-once
+
+A fault-tolerance guarantee and data delivery approach where multiple attempts
are made at delivering
+an event such that at least one succeeds. This guarantees that nothing is
lost, but you may experience
+duplicated results.
+
+#### At-most-once
+
+A data delivery approach where each event is delivered zero or one times.
There is lower latency but
+events may be lost.
+
+#### Backpressure
+
+A situation where a system is receiving data at a higher rate than it can
process during a temporary
+load spike.
+
+#### Barrier Alignment
+
+For providing exactly-once guarantees, Flink aligns the streams at operators
that receive multiple
+input streams, so that the snapshot will reflect the state resulting from
consuming events from both
+input streams up to (but not past) both barriers.
+
+#### Batch Processing
+
+This is the processing and analysis on a set of data that have already been
stored over a period
+of time (i.e. in groups or batches). The results are usually not available in
real-time. Flink
+executes batch programs as a special case of streaming programs.
+
+#### Bounded Streams
+
+Bounded [DataStreams](#datastream) have a defined start and end. They can be
processed by ingesting
+all data before performing any computations. Ordered ingestion is not required
to process bounded streams
+because a bounded data set can always be sorted. Processing of bounded streams
is also known as
+[batch processing](#batch-processing).
+
+#### Checkpoint
+
+A [snapshot](#snapshot) taken automatically by Flink for the purpose of being
able to recover from
+faults. A checkpoint marks a specific point in each of the input streams along
with the corresponding
+state for each of the operators. Checkpoints can be incremental and unaligned,
and are optimized for
+being restored quickly.
+
+#### Checkpoint Barrier
+
+A special marker that flows along the graph and triggers the checkpointing
process on each of the
+parallel instances of the operators. Checkpoint barriers are injected into the
source operators and
+flow together with the data. If an operator has multiple outputs, it gets
"split" into both of them.
+
+#### Checkpoint Coordinator
+
+This coordinates the distributed snapshots of operators and state. It is part
of the JobManager and
+instructs the TaskManager when to begin a checkpoint by sending the messages
to the relevant tasks
+and collecting the checkpoint acknowledgements.
+
#### Checkpoint Storage
-The location where the [State Backend](#state-backend) will store its snapshot
during a checkpoint (Java Heap of [JobManager](#flink-jobmanager) or
Filesystem).
+The location where the [state backend](#state-backend) will store its snapshot
during a checkpoint.
+This could be on the Java heap of the [JobManager](#flink-jobmanager) or on a
file system.
+
+#### (Flink) Client
+
+This is not part of the runtime and program execution but is used to prepare
and send a dataflow graph
+to the JobManager. The Flink client runs either as part of the program that
triggers the execution or
+in the command line process via `./bin/flink run`.
+
+#### (Flink) Cluster
-#### Flink Application Cluster
+A distributed system consisting of (typically) one [JobManager](#jobmanager)
and one or more
+[TaskManager](#taskmanager) processes.
-A Flink Application Cluster is a dedicated [Flink Cluster](#flink-cluster) that
-only executes [Flink Jobs](#flink-job) from one [Flink
-Application](#flink-application). The lifetime of the [Flink
-Cluster](#flink-cluster) is bound to the lifetime of the Flink Application.
+#### Connected Streams
-#### Flink Job Cluster
+A pattern in Flink where a single operator has two input streams. Connected
streams can also be used
+to implement streaming joins.
-A Flink Job Cluster is a dedicated [Flink Cluster](#flink-cluster) that only
-executes a single [Flink Job](#flink-job). The lifetime of the
-[Flink Cluster](#flink-cluster) is bound to the lifetime of the Flink Job.
-This deployment mode has been deprecated since Flink 1.15.
+#### Connectors
-#### Flink Cluster
+Connectors allow [Flink applications](#(flink)-applications) to read from and
write to various external
+systems. They support multiple formats in order to encode and decode data to
match Flink’s data structures.
-A distributed system consisting of (typically) one
[JobManager](#flink-jobmanager) and one or more
-[Flink TaskManager](#flink-taskmanager) processes.
+#### Dataflow
+
+See [logical graph](#logical-graph).
+
+#### DataStream
+
+This is a collection of data in a Flink application. You can think of them as
immutable collections
+of data that can contain duplicates. This data can either be finite or
unbounded.
+
+#### Directed Acyclic Graph (DAG)
+
+This is a graph that is directed and without cycles connecting the other
edges. It can be used to
+conceptually represent a [dataflow](#dataflow) where you never look back to
previous events.
+
+#### Dispatcher
+
+This is a component of the [JobManager](#jobmanager) and provides a REST
interface to submit Flink
+applications for execution and starts a new [JobMaster](#jobmaster) for each
submitted job. It also
+runs the Flink web UI to provide information about job executions.
#### Event
-An event is a statement about a change of the state of the domain modelled by
the
-application. Events can be input and/or output of a stream or batch processing
application.
-Events are special types of [records](#Record).
+An event is a statement about a change of the state of the domain modelled by
the application. Events
+can be input and/or output of a stream processing application. Events are
special types of
+[records](#Record).
+
+#### Event Time
+
+The time when an [event](#event) occurred, as recorded by the device producing
(or storing) the event.
+For reproducible results, you should use event time because the result does
not depend on when the
+calculation is performed.
+
+If you want to use event time, you will also need to supply a Timestamp
Extractor and Watermark Generator
+that Flink will use to track the progress of event time.
+
+#### Exactly-once
+
+A fault-tolerance guarantee and data delivery approach where nothing is lost
or duplicated. This does
+not mean that every event will be processed exactly once. Instead, it means
that every event will affect
+the state being managed by Flink exactly once.
#### ExecutionGraph
-see [Physical Graph](#physical-graph)
+See [Physical Graph](#physical-graph).
+
+#### Externalized Checkpoint
-#### Function
+A checkpoint that is configured to be retained instead of being deleted when a
job is cancelled.
+Flink normally retains only the n-most-recent checkpoints (n being
configurable) while a job is running
+and deletes them when a job is cancelled.
-Functions are implemented by the user and encapsulate the
-application logic of a Flink program. Most Functions are wrapped by a
corresponding
-[Operator](#operator).
+You can manually resume from an externalized checkpoint.
-#### Instance
+#### Format
-The term *instance* is used to describe a specific instance of a specific type
(usually
-[Operator](#operator) or [Function](#function)) during runtime. As Apache
Flink is mostly written in
-Java, this corresponds to the definition of *Instance* or *Object* in Java. In
the context of Apache
-Flink, the term *parallel instance* is also frequently used to emphasize that
multiple instances of
-the same [Operator](#operator) or [Function](#function) type are running in
parallel.
+A table format is a storage format that defines how to map binary data onto
table columns.
+Flink comes with a variety of built-in output formats that can be used with
table [connectors](#connector).
-#### Flink Application
+#### Ingestion Time
-A Flink application is a Java Application that submits one or multiple [Flink
-Jobs](#flink-job) from the `main()` method (or by some other means). Submitting
-jobs is usually done by calling `execute()` on an execution environment.
+A timestamp recorded by Flink at the moment it ingests the event.
-The jobs of an application can either be submitted to a long running [Flink
-Session Cluster](#flink-session-cluster), to a dedicated [Flink Application
-Cluster](#flink-application-cluster), or to a [Flink Job
-Cluster](#flink-job-cluster).
+#### (Flink) Job
-#### Flink Job
+This is the runtime representation of a [logical graph](#logical-graph) (also
often called dataflow
+graph) that is created and submitted by calling `execute()` in a [Flink
application](#flink-application).
-A Flink Job is the runtime representation of a [logical graph](#logical-graph)
-(also often called dataflow graph) that is created and submitted by calling
-`execute()` in a [Flink Application](#flink-application).
+#### Job Cluster
+
+This is a dedicated [Flink cluster](#(flink)-cluster) that only executes a
single [Flink job](#(flink)-job).
+The lifetime of the Flink cluster is bound to the lifetime of the Flink job.
This deployment mode has
+been deprecated since Flink 1.15.
#### JobGraph
-see [Logical Graph](#logical-graph)
+See [Logical Graph](#logical-graph).
+
+#### JobManager
-#### Flink JobManager
+The JobManager is the orchestrator of a [Flink cluster](#(flink)-cluster). It
contains three distinct
+components: ResourceManager, Dispatcher, and a [JobMaster](#jobmaster) per
running [Flink job](#(flink)-job).
-The JobManager is the orchestrator of a [Flink Cluster](#flink-cluster). It
contains three distinct
-components: Flink Resource Manager, Flink Dispatcher and one [Flink
JobMaster](#flink-jobmaster)
-per running [Flink Job](#flink-job).
+There is always at least one JobManager. A high-availability setup might have
multiple JobManagers,
+one of which is always the leader.
-#### Flink JobMaster
+#### JobMaster
-JobMasters are one of the components running in the
[JobManager](#flink-jobmanager). A JobMaster is
-responsible for supervising the execution of the [Tasks](#task) of a single
job.
+This is one of the components that run in the [JobManager](#jobmanager). It is
responsible for supervising
+the execution of the [tasks](#task) of a single [job](#(flink)-job). Multiple
jobs can run simultaneously
+in a [Flink cluster](#(flink)-cluster), each having its own JobMaster.
#### JobResultStore
-The JobResultStore is a Flink component that persists the results of globally
terminated
-(i.e. finished, cancelled or failed) jobs to a filesystem, allowing the
results to outlive
-a finished job. These results are then used by Flink to determine whether jobs
should
-be subject to recovery in highly-available clusters.
+The JobResultStore is a Flink component that persists the results of globally
terminated (i.e. finished,
+cancelled or failed) jobs to a filesystem, allowing the results to outlive a
finished job. These results
+are then used by Flink to determine whether jobs should be subject to recovery
in highly-available clusters.
+
+#### Key Group
+
+These are the atomic unit by which Flink can redistribute [keyed
state](#keyed-state). There are
+exactly as many key groups as the defined maximum parallelism. During
execution, each parallel instance
+of a keyed operator works with the keys for one or more key groups.
+
+#### Keyed State
+
+Keyed state is one of the two basic types of state in Apache Flink (the other
being operator state).
+In order to have all events with the same value of an attribute grouped
together, you can partition
+a stream around that attribute, and maintain it as an embedded key/value
store. This results in a keyed
+state.
+
+A keyed state is always bound to keys and is only available to functions and
operators that process
+data from a keyed stream.
+
+Flink supports several different types of keyed state, with the simplest one
being [ValueState](#valuestate).
+
+#### Keyed Stream
+
+A keyed stream is a [DataStream](#DataStream) on which [operator
state](#operator-state) is partitioned
+by a key. Typical operations supported by a DataStream are also possible on a
keyed stream, except for
+partitioning methods such as shuffle, forward, and keyBy.
+
+#### Lateness
+
+Lateness is defined relative to the [watermarks](#watermark). A watermark(t)
asserts that the stream
+is complete up through to time t. Any event is considered late if it comes
after the watermark whose
+timestamp is ≤ t.
+
+#### ListState<T>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of elements.
You can append elements
+and retrieve an Iterable over all currently stored elements. Elements are
added using add(T) or
+addAll(List<T>). The Iterable can be retrieved using Iterable<T> get().
#### Logical Graph
-A logical graph is a directed graph where the nodes are [Operators](#operator)
-and the edges define input/output-relationships of the operators and correspond
-to data streams or data sets. A logical graph is created by submitting jobs
-from a [Flink Application](#flink-application).
+This is a directed graph where the nodes are [operators](#operator) and the
edges define input/output
+relationships of the operators and correspond to [DataStreams](#datastreams).
A logical graph is created
+by submitting jobs to a [Flink cluster](#(flink)-cluster) from a [Flink
application](#(flink)-application).
-Logical graphs are also often referred to as *dataflow graphs*.
+Logical graphs are also often referred to as [dataflow](#dataflow).
#### Managed State
-Managed State describes application state which has been registered with the
framework. For
-Managed State, Apache Flink will take care about persistence and rescaling
among other things.
+Managed state is application state which has been registered with the stream
processing framework,
+which will take care of the persistence and rescaling of this state.
+
+This type of state is represented in data structures controlled by the Flink
runtime, such as internal
+hash tables, or RocksDB. Flink’s runtime encodes the states and writes them
into the checkpoints.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: managed and [raw](#raw-state).
+
+#### MapState<UK, UV>
+
+This is a type of [keyed state](#keyed-state) that keeps a list of mappings.
You can put key-value
+pairs into the state and retrieve an Iterable over all currently stored
mappings. Mappings are added
+using put(UK, UV) or putAll(Map<UK, UV>). The value associated with a key can
be retrieved using get(UK).
+
+#### Non-keyed State
+
+This type of state is bound to one parallel operator instance and is also
called [operator state](#operator-state).
+
+It is possible to work with [managed state](#managed-state) in non-keyed
contexts but it is unusual
+for user-defined functions to need non-keyed state and the interfaces involved
would be different.
+
+This feature is most often used in the implementation of [sources](#source)
and [sinks](#sink).
+
+#### Offset
+
+A number identifying how far you are from the beginning of a certain
[DataStream](#datastream).
#### Operator
-Node of a [Logical Graph](#logical-graph). An Operator performs a certain
operation, which is
-usually executed by a [Function](#function). Sources and Sinks are special
Operators for data
+An operator is a node of a [logical graph](#logical-graph). An operator
performs a certain operation,
+which is usually executed by a [function](#function). Sources and sinks are
special operators for data
ingestion and data egress.
#### Operator Chain
-An Operator Chain consists of two or more consecutive [Operators](#operator)
without any
-repartitioning in between. Operators within the same Operator Chain forward
records to each other
-directly without going through serialization or Flink's network stack.
+An operator chain consists of two or more consecutive [operators](#operator)
without any
+repartitioning in between. Operators within the same operator chain forward
records to each other
+directly without going through serialization or Flink's network stack. This is
a useful optimization
+and increases overall throughput while decreasing latency. The chaining
behavior can be configured.
+
+#### Operator State
+
+See [non-keyed state](#non-keyed-state).
+
+#### Parallelism
+
+This is a technique for making programs run faster by performing several
computations simultaneously.
#### Partition
-A partition is an independent subset of the overall data stream or data set. A
data stream or
-data set is divided into partitions by assigning each [record](#Record) to one
or more partitions.
-Partitions of data streams or data sets are consumed by [Tasks](#task) during
runtime. A
-transformation which changes the way a data stream or data set is partitioned
is often called
-repartitioning.
+A partition is an independent subset of the overall [DataStream](#datastream).
A DataStream is divided
+into partitions by assigning each [record](#record) to one or more partitions
via keys. Partitions of
+DataStreams are consumed by [tasks](#task) during runtime. A transformation
that changes the way a
+DataStream is partitioned is often called repartitioning.
#### Physical Graph
-A physical graph is the result of translating a [Logical
Graph](#logical-graph) for execution in a
-distributed runtime. The nodes are [Tasks](#task) and the edges indicate
input/output-relationships
-or [partitions](#partition) of data streams or data sets.
+A physical graph is the result of translating a [logical
graph](#logical-graph) for execution in a
+distributed runtime. The nodes are [tasks](#task) and the edges indicate
input/output relationships
+or [partitions](#partition) of DataStreams.
+
+#### POJO
+
+This is a composite data type and can be serialized with Flink's serializer.
Flink recognizes a data
+type as a POJO type (and allows “by-name” field referencing) if the following
conditions are met:
+
+- the class is public and standalone (no non-static inner class)
+- the class has a public no-argument constructor
+- all non-static, non-transient fields in the class (and all superclasses) are
either public (and
+ non-final) or have public getter- and setter- methods that follow the Java
naming conventions for
+ getters and setters
+
+Flink analyzes the structure of POJO types and can process POJOs more
efficiently than general types.
+
+#### Process Functions
+
+This type of function combines event processing with timers and state and is
the basis for creating
+event-driven applications with Flink.
+
+#### Processing Time
+
+The time when a specific operator in your pipeline is processing the event.
Computing analytics based
+on processing time can cause inconsistencies and make it difficult to
re-analyze historic data or test
+new implementations.
+
+#### Queryable State
+
+This is managed keyed (partitioned) state that can be accessed from outside of
Flink during runtime.
+
+#### Raw State
+
+This is state that operators keep in their own data structures. When
checkpointed, only a sequence of
+bytes is written into the checkpoint and Flink knows nothing about the state’s
data structures and will
+see only the raw bytes.
+
+[Keyed state](#keyed-state) and [operator state](#operator-state) exist in two
forms: [managed](#managed-state) and raw.
#### Record
-Records are the constituent elements of a data set or data stream.
[Operators](#operator) and
-[Functions](#Function) receive records as input and emit records as output.
+Records are the elements that make up a [DataStream](#datastream).
[Operators](#operator) and [functions](#function)
+receive records as input and emit records as output.
+
+#### ResourceManager
+
+This is part of the [JobManager](#JobManager) and is responsible for resource
de-/allocation and
+provisioning in a Flink cluster.
+
+#### Rich Functions
+
+A RichFunction is a "rich" variant of Flink's function interfaces for data
transformation. These functions
+have some additional methods needed for working with managed keyed state such
as `open(Configuration c)`,
+`close()`, `getRuntimeContext()`.
+
+#### Rolling Total
+
+The sum of a sequence of numbers which is updated each time a new number is
added to the sequence,
+by adding the value of the new number to the previous rolling total.
#### (Runtime) Execution Mode
-DataStream API programs can be executed in one of two execution modes: `BATCH`
-or `STREAMING`. See [Execution Mode]({{< ref
"/docs/dev/datastream/execution_mode" >}}) for more details.
+DataStream API programs can be executed in one of two execution modes: `BATCH`
or `STREAMING`.
+See [Execution Mode]({{< ref "/docs/dev/datastream/execution_mode" >}}) for
more details.
+
+#### Savepoint
+
+A [snapshot](#snapshot) triggered manually by a user (or an API call) for some
operational purpose,
+such as a stateful redeploy/upgrade/rescaling. Savepoints are always complete
and aligned and are
+optimized for operational flexibility.
+
+#### Scalar
+
+A scalar refers to a single value. This is in contrast to a set of values.
Review comment:
I would remove this one, this is an informatics known concept and there
is nothing specific to add here in relation to Flink
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]