[GitHub] [flink-web] zentol commented on a diff in pull request #631: Add blog article: Howto migrate a real-life batch pipeline from the DataSet API to the DataStream API

via GitHub Wed, 19 Apr 2023 04:17:29 -0700


zentol commented on code in PR #631:
URL: https://github.com/apache/flink-web/pull/631#discussion_r1171189504



##########
docs/content/posts/2023-04-12-howto-migrate-to-datastream.md:
##########
@@ -0,0 +1,190 @@
+---
+title:  "Howto migrate a real-life batch pipeline from the DataSet API to the 
DataStream API"
+date: "2023-04-12T08:00:00.000Z"
+authors:
+
+- echauchot:
+  name: "Etienne Chauchot"
+  twitter: "echauchot"
+  aliases:
+- /2023/04/12/2023-04-12-howto-migrate-to-datastream.html
+
+---
+
+## Introduction
+
+The Flink community has been deprecating the DataSet API since version 1.12 as 
part of the work on
+[FLIP-131: Consolidate the user-facing Dataflow SDKs/APIs (and deprecate the 
DataSet 
API)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741)
+.
+This blog article illustrates the migration of a real-life batch DataSet 
pipeline to a batch
+DataStream pipeline.
+All the code presented in this article is available in
+the [tpcds-benchmark-flink 
repo](https://github.com/echauchot/tpcds-benchmark-flink).
+The use case shown here is extracted from a broader work comparing Flink 
performances of different
+APIs
+by implementing [TPCDS](https://www.tpc.org/tpcds/) queries using these APIs.
+
+## What is TPCDS?
+
+TPC-DS is a decision support benchmark that models several generally 
applicable aspects of a
+decision support system. The purpose of TPCDS benchmarks is to provide 
relevant, objective
+performance data of Big Data engines to industry users.
+
+## Chosen TPCDS query
+
+The chosen query for this article is **Query3**  because it contains all the 
more common analytics
+operators (filter, join, aggregation, group by, order by, limit). It 
represents an analytic query on
+store sales. Its SQL code is presented here:
+
+`SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand 
brand,SUM(ss_ext_sales_price) sum_agg
+FROM  date_dim dt, store_sales, item
+WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
+AND store_sales.ss_item_sk = item.i_item_sk
+AND item.i_manufact_id = 128
+AND dt.d_moy=11
+GROUP BY dt.d_year, item.i_brand, item.i_brand_id
+ORDER BY dt.d_year, sum_agg desc, brand_id
+LIMIT 100`
+
+## The initial DataSet pipeline
+
+The pipeline we are migrating
+is [this 
one](https://github.com/echauchot/tpcds-benchmark-flink/blob/f342c1983ec340e52608eb1835e85c82c8ece1d2/src/main/java/org/example/tpcds/flink/Query3ViaFlinkRowDataset.java)
+, it is a batch pipeline that implements the above query using the DataSet API

Review Comment:
   ```suggestion
   is 
[this](https://github.com/echauchot/tpcds-benchmark-flink/blob/f342c1983ec340e52608eb1835e85c82c8ece1d2/src/main/java/org/example/tpcds/flink/Query3ViaFlinkRowDataset.java)
   batch pipeline that implements the above query using the DataSet API
   ```



##########
docs/content/posts/2023-04-12-howto-migrate-to-datastream.md:
##########
@@ -2,29 +2,39 @@
 title:  "Howto migrate a real-life batch pipeline from the DataSet API to the 
DataStream API"
 date: "2023-04-12T08:00:00.000Z"
 authors:
+
 - echauchot:
   name: "Etienne Chauchot"
   twitter: "echauchot"
-aliases:
-- /news/2023/04/12/2023-04-12-howto-migrate-to-datastream.html
+  aliases:
+- /2023/04/12/2023-04-12-howto-migrate-to-datastream.html

Review Comment:
   we shouldnt need an alias; this is only required for older blog posts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink-web] zentol commented on a diff in pull request #631: Add blog article: Howto migrate a real-life batch pipeline from the DataSet API to the DataStream API

Reply via email to