yanghua commented on a change in pull request #2181:
URL: https://github.com/apache/hudi/pull/2181#discussion_r508370772
##########
File path: docs/_posts/2020-10-15-apache-hudi-meets-apache-flink.md
##########
@@ -0,0 +1,196 @@
+---
+title: "Apache Hudi meets Apache Flink"
+excerpt: "The design and latest progress of the integration of Apache hudi and
Apache Flink."
+author: wangxianghu
+category: blog
+---
+
+Apache Hudi is a data lake framework developed and open sourced by Uber. Hudi
joined the Apache incubator for incubation in January 2019, and was promoted to
the top Apache project in May 2020. It is one of the most popular data lake
frameworks.
+
+## 1. Why decouple
+
+Hudi has been using Spark as its data processing engine since its birth. If
users want to use Hudi as their data lake framework, they must introduce Spark
into their platform technology stack.
+A few years ago, using Spark as a big data processing engine can be said to be
very common or even natural. Since Spark can either perform batch processing or
use micro-batch to simulate streaming, one engine solves both streaming and
batch problems.
+However, in recent years, with the development of big data technology, Flink,
which is also a big data processing engine, has gradually entered people's
vision and has occupied a certain market in the field of computing engines.
+In the big data technology community, forums and other territories, the voice
of whether Hudi supports flink has gradually appeared and has become more
frequent. Therefore, it is a valuable thing to make hudi support the flink
engine, and the first step of integrating the Flink engine is that hudi and
spark are decoupled.
+
+In addition, looking at the mature, active, and viable frameworks in the big
data, all frameworks are elegant in design and can be integrated with other
frameworks and leverage each other's expertise.
+Therefore, decoupling Hudi from Spark and turning it into an
engine-independent data lake framework will undoubtedly create more
possibilities for the integration of Hudi and other components, allowing Hudi
to better integrate into the big data ecosystem.
+
+## 2. Challenges
+
+Hudi's internal use of Spark API is as common as our usual development and use
of List. Since the data source reads the data, and finally writes the data to
the table, Spark RDD is used as the main data structure everywhere, and even
ordinary tools are implemented using the Spark API.
+It can be said that Hudi is a universal data lake framework implemented by
Spark. Hudi also leverage deep Spark functionality like custom partitioning,
in-memory caching to implement indexing and file sizing using workload
heuristics.
Review comment:
leverage -> leverages
##########
File path: docs/_posts/2020-10-15-apache-hudi-meets-apache-flink.md
##########
@@ -0,0 +1,196 @@
+---
+title: "Apache Hudi meets Apache Flink"
+excerpt: "The design and latest progress of the integration of Apache hudi and
Apache Flink."
+author: wangxianghu
+category: blog
+---
+
+Apache Hudi is a data lake framework developed and open sourced by Uber. Hudi
joined the Apache incubator for incubation in January 2019, and was promoted to
the top Apache project in May 2020. It is one of the most popular data lake
frameworks.
+
+## 1. Why decouple
+
+Hudi has been using Spark as its data processing engine since its birth. If
users want to use Hudi as their data lake framework, they must introduce Spark
into their platform technology stack.
+A few years ago, using Spark as a big data processing engine can be said to be
very common or even natural. Since Spark can either perform batch processing or
use micro-batch to simulate streaming, one engine solves both streaming and
batch problems.
+However, in recent years, with the development of big data technology, Flink,
which is also a big data processing engine, has gradually entered people's
vision and has occupied a certain market in the field of computing engines.
+In the big data technology community, forums and other territories, the voice
of whether Hudi supports flink has gradually appeared and has become more
frequent. Therefore, it is a valuable thing to make hudi support the flink
engine, and the first step of integrating the Flink engine is that hudi and
spark are decoupled.
Review comment:
make sure all "flink" -> "Flink", "spark" -> "Spark", "hudi" -> "Hudi"
##########
File path: docs/_posts/2020-10-15-apache-hudi-meets-apache-flink.md
##########
@@ -0,0 +1,196 @@
+---
+title: "Apache Hudi meets Apache Flink"
+excerpt: "The design and latest progress of the integration of Apache hudi and
Apache Flink."
+author: wangxianghu
+category: blog
+---
+
+Apache Hudi is a data lake framework developed and open sourced by Uber. Hudi
joined the Apache incubator for incubation in January 2019, and was promoted to
the top Apache project in May 2020. It is one of the most popular data lake
frameworks.
+
+## 1. Why decouple
+
+Hudi has been using Spark as its data processing engine since its birth. If
users want to use Hudi as their data lake framework, they must introduce Spark
into their platform technology stack.
+A few years ago, using Spark as a big data processing engine can be said to be
very common or even natural. Since Spark can either perform batch processing or
use micro-batch to simulate streaming, one engine solves both streaming and
batch problems.
+However, in recent years, with the development of big data technology, Flink,
which is also a big data processing engine, has gradually entered people's
vision and has occupied a certain market in the field of computing engines.
+In the big data technology community, forums and other territories, the voice
of whether Hudi supports flink has gradually appeared and has become more
frequent. Therefore, it is a valuable thing to make hudi support the flink
engine, and the first step of integrating the Flink engine is that hudi and
spark are decoupled.
+
+In addition, looking at the mature, active, and viable frameworks in the big
data, all frameworks are elegant in design and can be integrated with other
frameworks and leverage each other's expertise.
+Therefore, decoupling Hudi from Spark and turning it into an
engine-independent data lake framework will undoubtedly create more
possibilities for the integration of Hudi and other components, allowing Hudi
to better integrate into the big data ecosystem.
+
+## 2. Challenges
+
+Hudi's internal use of Spark API is as common as our usual development and use
of List. Since the data source reads the data, and finally writes the data to
the table, Spark RDD is used as the main data structure everywhere, and even
ordinary tools are implemented using the Spark API.
+It can be said that Hudi is a universal data lake framework implemented by
Spark. Hudi also leverage deep Spark functionality like custom partitioning,
in-memory caching to implement indexing and file sizing using workload
heuristics.
+For some of these, Flink offers better out-of-box support (e.g using Flink’s
state store for indexing) and can in fact, make Hudi approach real-time
latencies more and more.
+
+In addition, the primary engine integrated after this decoupling is Flink.
Flink and Spark differ greatly in core abstraction. Spark believes that data is
bounded, and its core abstraction is a limited set of data.
+Flink believes that the essence of data is stream, and its core abstract
DataStream contains various operations on data. Hudi, has a streaming first
design (record level updates, record level streams), that arguably fit the
Flink model more naturally.
Review comment:
`stream ` -> `a stream`
##########
File path: docs/_posts/2020-10-15-apache-hudi-meets-apache-flink.md
##########
@@ -0,0 +1,196 @@
+---
+title: "Apache Hudi meets Apache Flink"
+excerpt: "The design and latest progress of the integration of Apache hudi and
Apache Flink."
+author: wangxianghu
+category: blog
+---
+
+Apache Hudi is a data lake framework developed and open sourced by Uber. Hudi
joined the Apache incubator for incubation in January 2019, and was promoted to
the top Apache project in May 2020. It is one of the most popular data lake
frameworks.
+
+## 1. Why decouple
+
+Hudi has been using Spark as its data processing engine since its birth. If
users want to use Hudi as their data lake framework, they must introduce Spark
into their platform technology stack.
+A few years ago, using Spark as a big data processing engine can be said to be
very common or even natural. Since Spark can either perform batch processing or
use micro-batch to simulate streaming, one engine solves both streaming and
batch problems.
+However, in recent years, with the development of big data technology, Flink,
which is also a big data processing engine, has gradually entered people's
vision and has occupied a certain market in the field of computing engines.
+In the big data technology community, forums and other territories, the voice
of whether Hudi supports flink has gradually appeared and has become more
frequent. Therefore, it is a valuable thing to make hudi support the flink
engine, and the first step of integrating the Flink engine is that hudi and
spark are decoupled.
+
+In addition, looking at the mature, active, and viable frameworks in the big
data, all frameworks are elegant in design and can be integrated with other
frameworks and leverage each other's expertise.
+Therefore, decoupling Hudi from Spark and turning it into an
engine-independent data lake framework will undoubtedly create more
possibilities for the integration of Hudi and other components, allowing Hudi
to better integrate into the big data ecosystem.
+
+## 2. Challenges
+
+Hudi's internal use of Spark API is as common as our usual development and use
of List. Since the data source reads the data, and finally writes the data to
the table, Spark RDD is used as the main data structure everywhere, and even
ordinary tools are implemented using the Spark API.
+It can be said that Hudi is a universal data lake framework implemented by
Spark. Hudi also leverage deep Spark functionality like custom partitioning,
in-memory caching to implement indexing and file sizing using workload
heuristics.
+For some of these, Flink offers better out-of-box support (e.g using Flink’s
state store for indexing) and can in fact, make Hudi approach real-time
latencies more and more.
+
+In addition, the primary engine integrated after this decoupling is Flink.
Flink and Spark differ greatly in core abstraction. Spark believes that data is
bounded, and its core abstraction is a limited set of data.
+Flink believes that the essence of data is stream, and its core abstract
DataStream contains various operations on data. Hudi, has a streaming first
design (record level updates, record level streams), that arguably fit the
Flink model more naturally.
Review comment:
`Hudi,` -> `Hudi`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]