[GitHub] [hudi] yanghua commented on a change in pull request #1992: [BLOG] Incremental processing on data lakes by vinoyang

GitBox Thu, 20 Aug 2020 02:33:19 -0700


yanghua commented on a change in pull request #1992:
URL: https://github.com/apache/hudi/pull/1992#discussion_r473611869




##########
File path: docs/_posts/2020-08-18-hudi-incremental-processing-on-data-lakes.md
##########
@@ -0,0 +1,275 @@
+---
+title: "Incremental Processing on the Data Lake"
+excerpt: "How Apache Hudi provides ability for incremental data processing."
+author: vinoyang
+category: blog
+---
+
+### NOTE: This article is a translation of the infoq.cn article, found 
[here](https://www.infoq.cn/article/CAgIDpfJBVcJHKJLSbhe), with minor edits
+
+Apache Hudi is a data lake framework which provides the ability to ingest, 
manage and query large analytical data sets on a distributed file system/cloud 
stores. 
+Hudi joined the Apache incubator for incubation in January 2019, and was 
promoted to the top Apache project in May 2020. This article mainly discusses 
the importance 
+of Hudi to the data lake from the perspective of "incremental processing". 
More information about Apache Hudi's framework functions, features, usage 
scenarios, and 
+latest developments can be found at QCon Global Software Development 
Conference (Beijing Station) 2020.

Review comment:
       `(Beijing Station)` -> `(Shanghai Station)`

##########
File path: docs/_posts/2020-08-18-hudi-incremental-processing-on-data-lakes.md
##########
@@ -0,0 +1,275 @@
+---
+title: "Incremental Processing on the Data Lake"
+excerpt: "How Apache Hudi provides ability for incremental data processing."
+author: vinoyang
+category: blog
+---
+
+### NOTE: This article is a translation of the infoq.cn article, found 
[here](https://www.infoq.cn/article/CAgIDpfJBVcJHKJLSbhe), with minor edits
+
+Apache Hudi is a data lake framework which provides the ability to ingest, 
manage and query large analytical data sets on a distributed file system/cloud 
stores. 
+Hudi joined the Apache incubator for incubation in January 2019, and was 
promoted to the top Apache project in May 2020. This article mainly discusses 
the importance 
+of Hudi to the data lake from the perspective of "incremental processing". 
More information about Apache Hudi's framework functions, features, usage 
scenarios, and 
+latest developments can be found at QCon Global Software Development 
Conference (Beijing Station) 2020.
+
+Throughout the development of big data technology, Hadoop has steadily seized 
the opportunities of this era and has become the de-facto standard for 
enterprises to build big data infrastructure. 
+Among them, the distributed file system HDFS that supports the Hadoop 
ecosystem almost naturally has become the standard interface for big data 
storage systems. In recent years, with the rise of 
+cloud-native architectures, we have seen a wave of newer models embracing 
low-cost cloud storage emerging, a number of data lake frameworks compatible 
with HDFS interfaces 
+embracing cloud vendor storage have emerged in the industry as well. 
+
+However, we are still processing data pretty much in the same way we did 10 
years ago. This article will try to talk about its importance to the data lake 
from the perspective of "incremental processing".
+
+## Traditional data lakes lack the primitives for incremental processing
+
+In the era of mobile Internet and Internet of Things, delayed arrival of data 
is very common. 
+Here we are involved in the definition of two time semantics: [event time and 
processing 
time](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/). 
+
+As the name suggests:
+
+ - **Event time:** the time when the event actually occurred;
+ - **Processing time:** the time when an event is observed (processed) in the 
system;
+
+Ideally, the event time and the processing time are the same, but in reality, 
they may have more or less deviation, which we often call "Time Skew". 
+Whether for low-latency stream computing or common batch processing, the 
processing of event time and processing time and late data is a common and 
difficult problem. 
+In general, in order to ensure correctness, when we strictly follow the "event 
time" semantics, late data will trigger the 
+[recalculation of the time 
window](https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#late-elements-considerations)
 
+(usually Hive partitions for batch processing), although the results of these 
"windows" may have been calculated or even interacted with the end user. 
+For recalculation, the extensible key-value storage structure is usually used 
in streaming processing, which is processed incrementally at the record/event 
level and optimized 
+based on point queries and updates. However, in data lakes, recalculating 
usually means rewriting the entire (immutable) Hive partition (or simply a 
folder in DFS), and 
+re-triggering the recalculation of cascading tasks that have consumed that 
Hive partition.
+
+With data lakes supporting massive amounts of data, many long-tail businesses 
still have a strong demand for updating cold data. However, for a long time, 
+the data in a single partition in the data lake was designed to be 
non-updatable. If it needs to be updated, the entire partition needs to be 
rewritten. 
+This will seriously damage the efficiency of the entire ecosystem. From the 
perspective of latency and resource utilization, these operations on Hadoop 
will incur expensive overhead.
+Besides, this overhead is usually also cascaded to the entire Hadoop data 
processing pipeline, which ultimately leads to an increase in latency by 
several hours.
+
+In response to the two problems mentioned above, if the data lake supports 
fine-grained incremental processing, we can incorporate changes into existing 
Hive partitions 
+more effectively, and provide a way for downstream table consumers to obtain 
only the changed data. For effectively supporting incremental processing, we 
can decompose it into the 
+following two primitive operations:
+
+ - **Update insert (upsert):** Conceptually, rewriting the entire partition 
can be regarded as a very inefficient upsert operation, which will eventually 
write much more data than the 
+original data itself. Therefore, support for (bulk) upsert is considered a 
very important feature. [Google's Mesa](https://research.google/pubs/pub42851/) 
(Google's data warehouse system) also 
+talks about several techniques that can be applied to rapid data ingestion 
scenarios.
+
+ - **Incremental consumption:** Although upsert can solve the problem of 
quickly releasing new data to a partition, downstream data consumers do not 
know 
+ which data has been changed from which time in the past. Usually, consumers 
can only know the changed data by scanning the entire partition/data table and 
+ recalculating all the data, which requires considerable time and resources. 
Therefore, we also need a mechanism to more efficiently obtain data records 
that 
+ have changed since the last time the partition was consumed.
+
+With the above two primitive operations, you can upsert a data set, and then 
incrementally consume from it, and create another (also incremental) data set 
to solve the two problems 
+we mentioned above and support many common cases, so as to support end-to-end 
incremental processing and reduce end-to-end latency. These two primitives 
combine with each other, 
+unlocking the ability of stream/incremental processing based on DFS 
abstraction.
+
+The storage scale of the data lake far exceeds that of the data warehouse. 
Although the two have different focuses on the definition of functions, 
+there is still a considerable intersection (of course, there are still 
disputes and deviations from definition and implementation. 
+This is not the topic this article tries to discuss). In any case, the data 
lake will support larger analytical data sets with cheaper storage, 
+so incremental processing is also very important for it. Next let's discuss 
the significance of incremental processing for the data lake.
+
+## The significance of incremental processing for the data lake
+
+### Streaming Semantics
+
+It has long been stated that there is a 
"[dualism](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)"
 
+between the change log (that is, the "flow" in the conventional sense we 
understand) and the table.
+
+![dualism](/assets/images/blog/incr-processing/image4.jpg)
+
+The core of this discussion is: if there is a change log, you can use these 
changes to generate a data table and get the current status. If you update a 
table, 
+you can record these changes and publish all "change logs" to the table's 
status information. This interchangeable nature is called "stream table 
duality" for short.
+
+A more general understanding of "stream table duality": when the business 
system is modifying the data in the MySQL table, MySQL will reflect these 
changes as Binlog, 
+if we publish these continuous Binlog (stream) to Kafka, and then let the 
downstream processing system subscribe to the Kafka, and use the state store to 
gradually 
+accumulate the intermediate results. Then the current state of this 
intermediate result can reflects the current snapshot of the table.
+
+If the two primitives mentioned above that support incremental processing can 
be introduced to the data lake, the above pipeline, which can reflect the 
+"duality of flow table", is also applicable on the data lake. Based on the 
first primitive, the data lake can also ingest the Binlog log streams in Kafka, 

Review comment:
       `"duality of flow table"` -> `"stream table duality"`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yanghua commented on a change in pull request #1992: [BLOG] Incremental processing on data lakes by vinoyang

Reply via email to