WencongLiu commented on code in PR #23362: URL: https://github.com/apache/flink/pull/23362#discussion_r1369707442
########## docs/content/docs/dev/datastream/how_to_migrate_from_dataset_to_datastream.md: ########## @@ -0,0 +1,660 @@ +--- +title: "How To Migrate From DataSet to DataStream" +weight: 302 +type: docs +bookToc: false +aliases: + - /dev/how_to_migrate_from_dataset_to_datastream.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# How To Migrate From DataSet to DataStream + +The DataSet API has been formally deprecated and will no longer receive active maintenance and support. It will be removed in the +Flink 2.0 version. Flink users are recommended to migrate from the DataSet API to the DataStream API, Table API and SQL for their +data processing requirements. DataSet operators can be implemented by the DataStream API. However, it's important to note that +different operators have varying costs in the implementation, and they can be categorized into three types: + +1. The first type of operators are quite similar to DataStream in terms of API usage. They can be easily implemented without much +complexity. +2. The second type of operators, on the other hand, have completely different names and API usage in DataStream. This can make the +job code more complex. +3. Lastly, the third type of operators not only have different names and API usage in DataStream, but they also involve additional +computation and shuffle costs. + +The subsequent sections will first introduce how to set the execution environment and provide detailed explanations on how to implement +each type of DataSet operators using the DataStream API, highlighting the specific considerations and challenges associated with each type. + + +## Setting the execution environment + +To execute a DataSet pipeline by DataStream API, we should first start by moving from ExecutionEnvironment to StreamExecutionEnvironment. +{{< tabs executionenv >}} +{{< tab "DataSet">}} +```java +ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); +``` +{{< /tab >}} +{{< /tabs>}} + +As the source of DataSet is always bounded, the execution mode is suggested to be set to RuntimeMode.BATCH to allow Flink to apply +additional optimizations for batch processing. +```java +StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); +executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH); +``` + +## Implement the DataSet API by DataStream + +### Same API Usage + +In the first type of operators, the usage of the API in DataStream is almost identical to that in DataSet. This means that +implementing these operators using the DataStream API is relatively straightforward and does not require significant modifications +or complexity in the code. + +#### Map + +{{< tabs mapfunc >}} +{{< tab "DataSet">}} +```java +dataSet.map(new MapFunction(){ + // implement user-defined map logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.map(new MapFunction(){ + // implement user-defined map logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + + +#### FlatMap + +{{< tabs flatmapfunc >}} +{{< tab "DataSet">}} +```java +dataSet.flatMap(new FlatMapFunction(){ + // implement user-defined flatmap logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.flatMap(new FlatMapFunction(){ + // implement user-defined flatmap logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Filter + +{{< tabs filterfunc >}} +{{< tab "DataSet">}} +```java +dataSet.filter(new FilterFunction(){ + // implement user-defined filter logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.filter(new FilterFunction(){ + // implement user-defined filter logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Union + +{{< tabs unionfunc >}} +{{< tab "DataSet">}} +```java +DataSet<String> input1 = // [...] +DataSet<String> input2 = // [...] Review Comment: I have modified all sample codes to use the "dataSet/dataStream" variables. ########## docs/content/docs/dev/datastream/how_to_migrate_from_dataset_to_datastream.md: ########## @@ -0,0 +1,660 @@ +--- +title: "How To Migrate From DataSet to DataStream" +weight: 302 +type: docs +bookToc: false +aliases: + - /dev/how_to_migrate_from_dataset_to_datastream.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# How To Migrate From DataSet to DataStream + +The DataSet API has been formally deprecated and will no longer receive active maintenance and support. It will be removed in the +Flink 2.0 version. Flink users are recommended to migrate from the DataSet API to the DataStream API, Table API and SQL for their +data processing requirements. DataSet operators can be implemented by the DataStream API. However, it's important to note that +different operators have varying costs in the implementation, and they can be categorized into three types: + +1. The first type of operators are quite similar to DataStream in terms of API usage. They can be easily implemented without much +complexity. +2. The second type of operators, on the other hand, have completely different names and API usage in DataStream. This can make the +job code more complex. +3. Lastly, the third type of operators not only have different names and API usage in DataStream, but they also involve additional +computation and shuffle costs. + +The subsequent sections will first introduce how to set the execution environment and provide detailed explanations on how to implement +each type of DataSet operators using the DataStream API, highlighting the specific considerations and challenges associated with each type. + + +## Setting the execution environment + +To execute a DataSet pipeline by DataStream API, we should first start by moving from ExecutionEnvironment to StreamExecutionEnvironment. +{{< tabs executionenv >}} +{{< tab "DataSet">}} +```java +ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); +``` +{{< /tab >}} +{{< /tabs>}} + +As the source of DataSet is always bounded, the execution mode is suggested to be set to RuntimeMode.BATCH to allow Flink to apply +additional optimizations for batch processing. +```java +StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment(); +executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH); +``` + +## Implement the DataSet API by DataStream + +### Same API Usage + +In the first type of operators, the usage of the API in DataStream is almost identical to that in DataSet. This means that +implementing these operators using the DataStream API is relatively straightforward and does not require significant modifications +or complexity in the code. + +#### Map + +{{< tabs mapfunc >}} +{{< tab "DataSet">}} +```java +dataSet.map(new MapFunction(){ + // implement user-defined map logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.map(new MapFunction(){ + // implement user-defined map logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + + +#### FlatMap + +{{< tabs flatmapfunc >}} +{{< tab "DataSet">}} +```java +dataSet.flatMap(new FlatMapFunction(){ + // implement user-defined flatmap logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.flatMap(new FlatMapFunction(){ + // implement user-defined flatmap logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Filter + +{{< tabs filterfunc >}} +{{< tab "DataSet">}} +```java +dataSet.filter(new FilterFunction(){ + // implement user-defined filter logic +}); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +dataStream.filter(new FilterFunction(){ + // implement user-defined filter logic +}); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Union + +{{< tabs unionfunc >}} +{{< tab "DataSet">}} +```java +DataSet<String> input1 = // [...] +DataSet<String> input2 = // [...] +DataSet<String> output = input1.union(input2); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +DataStream<String> input1 = // [...] +DataStream<String> input2 = // [...] +DataStream<String> output = input1.union(input2); +``` +{{< /tab >}} +{{< /tabs>}} + + +#### Rebalance + +{{< tabs rebalancefunc >}} +{{< tab "DataSet">}} +```java +DataSet<String> input = // [...] +DataSet<String> output = input.rebalance(); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +DataStream<String> input = // [...] +DataStream<String> output = input.rebalance(); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Reduce on Grouped DataSet + +{{< tabs reducegroupfunc >}} +{{< tab "DataSet">}} +```java +DataSet<Tuple2<String, Integer>> input = // [...] +DataSet<Tuple2<String, Integer>> output = input + .groupBy(value -> value.f0) + .reduce(new ReduceFunction(){ + // implement user-defined reduce logic + }); +``` +{{< /tab >}} +{{< tab "DataStream">}} +```java +DataStream<Tuple2<String, Integer>> input = // [...] + DataStream<Tuple2<String, Integer>> output = input + .keyBy(value -> value.f0) + .reduce(new ReduceFunction(){ + // implement user-defined reduce logic + }); +``` +{{< /tab >}} +{{< /tabs>}} + +#### Aggregate on Grouped DataSet + +{{< tabs aggregategroupfunc >}} +{{< tab "DataSet">}} +```java +DataSet<Tuple2<String, Integer>> input = // [...] +DataSet<Tuple2<String, Integer>> output = input + .groupBy(value -> value.f0) + // compute sum of the second field + // .aggregate(SUM, 1); + // compute min of the second field + // .aggregate(MIN, 1); + // compute max of the second field + // .aggregate(MAX, 1); Review Comment: I've separated each aggregate API into a single code block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
