[GitHub] [flink] twalthr commented on a change in pull request #15837: [FLINK-22537][docs] Add documentation how to interact with DataStream API

GitBox Thu, 06 May 2021 02:12:10 -0700


twalthr commented on a change in pull request #15837:
URL: https://github.com/apache/flink/pull/15837#discussion_r627233143




##########
File path: docs/content/docs/dev/table/data_stream_api.md
##########
@@ -0,0 +1,2106 @@
+---
+title: "DataStream API Integration"
+weight: 3
+type: docs
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# DataStream API Integration
+
+{{< hint info >}}
+Tthis page only discusses the integration with DataStream API in JVM languages 
such as Java or Scala.
+For Python, see the [Python API]({{< ref "docs/dev/python/overview" >}}) area.
+{{< /hint >}}
+
+Both Table & SQL API and DataStream API are equality important when it comes 
to defining a data
+processing pipeline.
+
+The DataStream API offers the primitives of stream processing (namely time, 
state, and dataflow
+management) in a rather low-level imperative programming API. The Table & SQL 
API abstracts many
+internals and provides a structured and declarative API.
+
+Both APIs can work with bounded *and* unbounded streams.
+
+Bounded streams need to be managed when processing historical data. Unbounded 
streams occur
+in real-time processing scenarios that might be initialized with historical 
data first.
+
+For efficient execution, both APIs offer processing bounded streams in an 
optimized batch execution
+mode. However, since batch is just a special case of streaming, it is also 
possible to run pipelines
+of bounded streams in regular streaming execution mode.
+
+{{< hint warning >}}
+Both DataStream API and Table API provide their own way for enabling the batch 
execution mode at the
+moment. In the near future, this will be further unified.
+{{< /hint >}}
+
+Pipelines in one API can be defined end-to-end without dependencies to the 
other API. However, it
+might be useful to mix both APIs for various reasons:
+
+- Use the table ecosystem for accessing catalogs or connecting to external 
systems easily, before
+implementing the main pipeline in DataStream API.
+- Access some of the SQL functions for stateless data normalization and 
cleansing, before
+implementing the main pipeline in DataStream API.
+- Switch to DataStream API every now and then if a more low-level operation 
(e.g. custom timer
+handling) is not present in Table API.
+
+Flink provides special bridging functionalities to make the integration with 
DataStream API as smooth
+as possible.
+
+{{< hint info >}}
+Switching between DataStream and Table API adds some conversion overhead. For 
example, internal data
+structures of the table runtime (i.e. `RowData`) that partially work on binary 
data need to be converted
+to more user-friendly data structures (i.e. `Row`). Usually, this overhead can 
be neglected but is
+mentioned here for completeness.
+{{< /hint >}}
+
+{{< top >}}
+
+Converting between DataStream and Table
+---------------------------------------
+
+Flink provides a specialized `StreamTableEnvironment` in Java and Scala for 
integrating with the
+DataStream API. Those environments extend the regular `TableEnvironment` with 
additional methods
+and take the `StreamExecutionEnvironment` used in the DataStream API as a 
parameter.
+
+{{< hint warning >}}
+Currently, the `StreamTableEnvironment` does not support batch execution mode. 
Use the regular `TableEnvironment`
+for this. Nevertheless, both bounded and unbounded streams can also be 
processed using streaming
+execution mode.
+{{< /hint >}}
+
+The following code shows an example of how to go back and forth between the 
two APIs. Column names
+and types of the `Table` are automatically derived from the `TypeInformation` 
of the `DataStream`.
+Since the DataStream API does not support changelog processing natively, the 
code assumes
+append-only/insert-only semantics during the stream-to-table and 
table-to-stream conversion.

Review comment:
       Yes it will fail. I added an example exception and more explanation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] twalthr commented on a change in pull request #15837: [FLINK-22537][docs] Add documentation how to interact with DataStream API

Reply via email to