[GitHub] [flink] MartijnVisser commented on a change in pull request #17260: [FLINK-21589][docs] Document table pipeline upgrades

GitBox Mon, 13 Sep 2021 05:35:15 -0700


MartijnVisser commented on a change in pull request #17260:
URL: https://github.com/apache/flink/pull/17260#discussion_r707279337




##########
File path: docs/content/docs/dev/table/concepts/overview.md
##########
@@ -32,6 +32,79 @@ This means that Table API and SQL queries have the same 
semantics regardless whe
 
 The following pages explain concepts, practical limitations, and 
stream-specific configuration parameters of Flink's relational APIs on 
streaming data.
 
+State Management
+----------------
+
+Table programs that run in streaming mode leverage all capabilities of Flink 
as a stateful stream
+processor.
+
+In particular, a table program can be configured with a [state backend]({{< 
ref "docs/ops/state/state_backends" >}})
+and various [checkpointing options]({{< ref 
"docs/dev/datastream/fault-tolerance/checkpointing" >}})
+for handling large amounts of state and fault tolerance. It is possible to 
take a savepoint of a running
+Table API & SQL pipeline and to restore the application's state at later point 
in time.
+
+### State Usage
+
+Due to the declarative nature of Table API & SQL program, it is not always 
obvious where and how much
+state is used within a table pipeline. The planner decides about when state is 
necessary to compute a correct
+result. A pipeline is optimized to claim as little state as possible given the 
current set of optimizer
+rules.
+
+{{< hint info >}}
+Source tables are never kept entirely in state. This depends on the used 
operations.
+{{< /hint >}}
+
+Simple `SELECT ... FROM ... WHERE` queries that only consist of field 
projections or filters are usually
+stateless pipelines. However, operations such as joins, aggregations, or 
deduplications require to keep
+intermediate results in a fault tolerant storage for which Flink's state 
abstractions are used.
+
+{{< hint info >}}
+Please refer to the individual operator documentation for more details about 
how much state is required
+and how to limit a potentially ever growing state size.
+{{< /hint >}}
+
+For example, a regular SQL join of two tables requires the operator to keep 
both input tables in state
+entirely. For correct SQL semantics, the runtime needs to assume that a 
matching could occur at any
+point in time from both sides. Flink provides [optimized window and interval 
joins]({{< ref "docs/dev/table/sql/queries/joins" >}})
+that aim to keep the state size small by exploiting the concept of 
[watermarks]({{< ref "docs/dev/table/concepts/time_attributes" >}}).
+
+### Stateful Upgrades and Evolution
+
+Table programs that are executed in streaming mode are intended as *standing 
queries* that statically
+define an end-to-end pipline.

Review comment:
       ```suggestion
   define an end-to-end pipeline.
   ```

##########
File path: docs/content/docs/dev/table/concepts/overview.md
##########
@@ -32,6 +32,79 @@ This means that Table API and SQL queries have the same 
semantics regardless whe
 
 The following pages explain concepts, practical limitations, and 
stream-specific configuration parameters of Flink's relational APIs on 
streaming data.
 
+State Management
+----------------
+
+Table programs that run in streaming mode leverage all capabilities of Flink 
as a stateful stream
+processor.
+
+In particular, a table program can be configured with a [state backend]({{< 
ref "docs/ops/state/state_backends" >}})
+and various [checkpointing options]({{< ref 
"docs/dev/datastream/fault-tolerance/checkpointing" >}})
+for handling large amounts of state and fault tolerance. It is possible to 
take a savepoint of a running
+Table API & SQL pipeline and to restore the application's state at later point 
in time.
+
+### State Usage
+
+Due to the declarative nature of Table API & SQL program, it is not always 
obvious where and how much
+state is used within a table pipeline. The planner decides about when state is 
necessary to compute a correct
+result. A pipeline is optimized to claim as little state as possible given the 
current set of optimizer
+rules.
+
+{{< hint info >}}
+Source tables are never kept entirely in state. This depends on the used 
operations.
+{{< /hint >}}
+
+Simple `SELECT ... FROM ... WHERE` queries that only consist of field 
projections or filters are usually
+stateless pipelines. However, operations such as joins, aggregations, or 
deduplications require to keep
+intermediate results in a fault tolerant storage for which Flink's state 
abstractions are used.
+
+{{< hint info >}}
+Please refer to the individual operator documentation for more details about 
how much state is required
+and how to limit a potentially ever growing state size.
+{{< /hint >}}
+
+For example, a regular SQL join of two tables requires the operator to keep 
both input tables in state
+entirely. For correct SQL semantics, the runtime needs to assume that a 
matching could occur at any
+point in time from both sides. Flink provides [optimized window and interval 
joins]({{< ref "docs/dev/table/sql/queries/joins" >}})
+that aim to keep the state size small by exploiting the concept of 
[watermarks]({{< ref "docs/dev/table/concepts/time_attributes" >}}).
+
+### Stateful Upgrades and Evolution
+
+Table programs that are executed in streaming mode are intended as *standing 
queries* that statically
+define an end-to-end pipline.
+
+In case of stateful pipelines, any change to both the query or Flink's planner 
might lead to a completely
+different execution plan. This makes stateful upgrades and the evolution of 
table programs challenging
+at the moment. The community is working on improving those shortcomings.
+
+For example, by adding a filter predicate, the optimizer might decide to 
reorder joins or change the
+schema of an intermediate operator. This prevents restoring from a savepoint 
due to either changed
+topology or different column layout within the state of an operator.
+
+The query implementer must ensure that the optimized plans before and after 
the change are compatible.
+Use the `EXPLAIN` command in SQL or `table.explain()` in Table API to [get 
insights]({{< ref "docs/dev/table/common" >}}#explaining-a-table).
+
+Since new optimizer rules are continously added, and operators become more 
efficient and specialized,

Review comment:
       ```suggestion
   Since new optimizer rules are continuously added, and operators become more 
efficient and specialized,
   ```

##########
File path: docs/content/docs/dev/table/concepts/overview.md
##########
@@ -32,6 +32,79 @@ This means that Table API and SQL queries have the same 
semantics regardless whe
 
 The following pages explain concepts, practical limitations, and 
stream-specific configuration parameters of Flink's relational APIs on 
streaming data.
 
+State Management
+----------------
+
+Table programs that run in streaming mode leverage all capabilities of Flink 
as a stateful stream
+processor.
+
+In particular, a table program can be configured with a [state backend]({{< 
ref "docs/ops/state/state_backends" >}})
+and various [checkpointing options]({{< ref 
"docs/dev/datastream/fault-tolerance/checkpointing" >}})
+for handling large amounts of state and fault tolerance. It is possible to 
take a savepoint of a running
+Table API & SQL pipeline and to restore the application's state at later point 
in time.

Review comment:
       ```suggestion
   Table API & SQL pipeline and to restore the application's state at a later 
point in time.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] MartijnVisser commented on a change in pull request #17260: [FLINK-21589][docs] Document table pipeline upgrades

Reply via email to