[GitHub] [hudi] xushiyan commented on a change in pull request #3322: [BLOG] Apache Hudi is the Streaming Data Lake Platform

GitBox Mon, 26 Jul 2021 11:21:01 -0700


xushiyan commented on a change in pull request #3322:
URL: https://github.com/apache/hudi/pull/3322#discussion_r676838941




##########
File path: docs/_posts/2021-07-21-streaming-data-lake-platform.md
##########
@@ -0,0 +1,150 @@
+---
+title: "Apache Hudi - The Streaming Data Lake Platform"
+excerpt: "It's been called many things. But, we have always been building a 
data lake platform"
+author: vinoth
+category: blog
+---
+
+As early as 2016, we set out a [bold, new 
vision](https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/)
 reimagining batch data processing through a new “**incremental**” data 
processing stack - alongside the existing batch and streaming stacks. 
+While a stream processing pipeline does row-oriented processing, delivering a 
few seconds of processing latency, an incremental pipeline would apply the same 
principles to *columnar* data in the data lake, 
+delivering orders of magnitude improvements in processing efficiency within 
few minutes, on extremely scalable batch storage/compute infrastructure. This 
new stack would be able to effortlessly support regular batch processing for 
bulk reprocessing/backfilling as well.
+Hudi was built as the manifestation of this vision, rooted in real, hard 
problems faced at [Uber](https://eng.uber.com/uber-big-data-platform/) and 
later took a life of its own in the open source community. Together, we have 
been able to 
+usher in fully incremental data ingestion and moderately complex ETLs on data 
lakes already.
+
+![the different components that make up the stream and batch processing stack 
today, showing how an incremental stack blends the best of both the 
worlds.](/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Page_2_4.png)
+
+Today, this grand vision of being able to express almost any batch pipeline 
incrementally is more attainable than it ever was. Stream processing is 
[maturing rapidly](https://flink.apache.org/blog/) and gaining [tremendous 
momentum](https://www.confluent.io/blog/every-company-is-becoming-software/), 
+with 
[generalization](https://flink.apache.org/2021/03/11/batch-execution-mode.html) 
of stream processing APIs to work over a batch execution model. Hudi completes 
the missing pieces of the puzzle by providing streaming optimized lake storage, 
+much like how Kafka/Pulsar enable efficient storage for event streaming. [Many 
organizations](https://hudi.apache.org/docs/powered_by.html) have already 
reaped real benefits of adopting a streaming model for their data lakes, in 
terms of fresh data, simplified architecture and great cost reductions.
+
+But first, we needed to tackle the basics - transactions and mutability - on 
the data lake. In many ways, Apache Hudi pioneered the transactional data lake 
movement as we know it today. Specifically, during a time when more 
special-purpose systems were being born, Hudi introduced a server-less, 
transaction layer, which worked over the general-purpose Hadoop FileSystem 
abstraction on Cloud Stores/HDFS. This model helped Hudi to scale 
writers/readers to 1000s of cores on day one, compared to warehouses which 
offer a richer set of transactional guarantees but are often bottlenecked by 
the 10s of servers that need to handle them. We also experience a lot of joy to 
see similar systems (Delta Lake for e.g) later adopt the same server-less 
transaction layer model that we originally shared way back in [early 
'17](https://eng.uber.com/hoodie/). We consciously introduced two table types 
Copy On Write (with simpler operability) and Merge On Read (for greater 
flexibility) and now these terms ar
 e used in [projects](https://github.com/apache/iceberg/pull/1862) outside 
Hudi, to refer to similar ideas being borrowed from Hudi. Through open sourcing 
and 
[graduating](https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces64)
 from the Apache Incubator, we have made some great progress elevating these 
ideas [across the industry](http://hudi.apache.org/docs/powered_by.html), as 
well as bringing them to life with a cohesive software stack. Given the 
exciting developments in the past year or so that have propelled data lakes 
further mainstream, we thought some perspective can help users see Hudi with 
the right lens, appreciate what it stands for, and be a part of where it’s 
headed. At this time, we also wanted to shine some light on all the great work 
done by [180+ contributors](https://github.com/apache/hudi/graphs/contributors) 
on the project, working with more than 2000 unique users over 
slack/github/jira, contributing all the different capabilities H
 udi has gained over the past years, from its humble beginnings.
+
+This is going to be a rather long post, but we will do our best to make it 
worth your time. Let’s roll.
+
+## Data Lake Platform
+
+We have noticed that, Hudi is sometimes positioned as a “[table 
format](https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc)”
 or “transactional layer”. While this is not incorrect, this does not do full 
justice to all that Hudi has to offer. 
+
+### Is Hudi a “format”?
+
+Hudi was not designed as a general purpose table format, tracking 
files/folders for batch processing. Rather, the functionality provided by a 
table format is merely one layer in the Hudi software stack. Hudi was designed 
to play well with the Hive format (if you will), given how popular and 
widespread it is. Over time, to solve scaling challenges or bring in additional 
functionality, we have invested in our own native table format with an eye for 
incremental processing vision. for e.g, we need to support shorter transactions 
that commit every few seconds. We believe these requirements would fully 
subsume challenges solved by general purpose table formats over time. But, we 
are also open to plugging in or syncing to other open table formats, so their 
users can also benefit from the rest of the Hudi stack. Unlike the file 
formats, a table format is merely a representation of table metadata and it’s 
actually quite possible to translate from Hudi to other formats/vice versa if 
users a
 re willing to accept the trade-offs.
+
+### Is Hudi a transactional layer?
+
+Of course, Hudi had to provide transactions for implementing deletes/updates, 
but Hudi’s transactional layer is designed around an [event 
log](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying)
 that is also well-integrated with an entire set of built-in table/data 
services. For e.g compaction is aware of clustering actions already scheduled 
and optimizes by skipping over the files being clustered - while the user is 
blissfully unaware of all this. Hudi also provides out-of-box tools for 
ingesting, ETLing data, and much more. We have always been thinking of Hudi as 
solving a database problem around stream processing - areas that are actually 
[very related to each 
other](https://www.infoq.com/presentations/streaming-databases/). In fact, 
Stream processing is enabled by logs (capture/emit event streams, 
rewind/reprocess) and databases (state stores, updatable sinks). With Hudi, the 
idea was that if we buil
 d a database supporting efficient updates and extracting data streams while 
remaining optimized for large batch queries, incremental pipelines can be built 
using Hudi tables as state store & update-able sinks. 
+
+Thus, the best way to describe Apache Hudi is as a **Streaming Data Lake 
Platform** built around a *database kernel*. The words carry significant 
meaning.
+
+![/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png](/assets/images/blog/datalake-platform/Screen_Shot_2021-07-20_at_5.35.47_PM.png)
+
+**Streaming**: At its core, by optimizing for fast upserts & change streams, 
Hudi provides the primitives to data lake workloads that are comparable to what 
[Apache Kafka](https://kafka.apache.org/) does for event-streaming (namely, 
incremental produce/consume of events and a state-store for interactive 
querying).
+
+**Data Lake**: Nonetheless, Hudi provides an optimized, self-managing data 
plane for large scale data processing on the lake (adhoc queries, ML pipelines, 
batch pipelines), powering arguably the [largest transactional 
lake](https://eng.uber.com/apache-hudi-graduation/) in the world. While Hudi 
can be used to build a 
[lakehouse](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html),
 given its transactional capabilities, Hudi goes beyond and unlocks an 
end-to-end streaming architecture. In contrast, the word “streaming” appears 
just 3 times in the lakehouse 
[paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf), and one of 
them is talking about Hudi.
+
+**Platform**: Oftentimes in open source, there is great tech, but there is 
just too many of them - all differing ever so slightly in their opinionated 
ways, ultimately making the integration task onerous on the end user. Lake 
users deserve the same great usability that cloud warehouses provide, with the 
additional freedom and transparency of a true open source community. Hudi’s 
data and table services, tightly integrated with the Hudi “kernel”, gives us 
the ability to deliver cross layer optimizations with reliability and ease of 
use.
+
+## Hudi Stack
+
+The following stack captures layers of software components that make up Hudi, 
with each layer depending on and drawing strength from the layer below. 
Typically, data lake users write data out once using an open file format like 
Apache [Parquet](http://parquet.apache.org/)/[ORC](https://orc.apache.org/) 
stored on top of extremely scalable cloud storage or distributed file systems. 
Hudi provides a self-managing data plane to ingest, transform and manage this 
data, in a way that unlocks incremental data processing on them.
+
+![Figure showing the Hudi 
stack](/assets/images/blog/datalake-platform/hudi-data-lake-platform_-_Copy_of_Page_1_3.png)
+
+Furthermore, Hudi either already provides or plans to add components that make 
this data universally accessible to all the different query engines out there. 
The features annotated with `*` represent work in progress and dotted boxes 
represent planned future work, to complete our vision for the project. 
+While we have strawman designs outlined for the newer components in the blog, 
we welcome with open arms fresh perspectives from the community.
+Rest of the blog will delve into each layer in our stack - explaining what it 
does, how it's designed for incremental processing and how it will evolve in 
the future.
+
+## Lake Storage
+
+Hudi interacts with lake storage using the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 which makes it compatible with all of its implementations ranging from HDFS to 
Cloud Stores to even in-memory filesystems like 
[Alluxio](https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/)/Ignite.
 Hudi internally implements its own [wrapper 
filesystem](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java)
 on top to provide additional storage optimizations (e.g: file sizing), 
performance optimizations (e.g: buffering), and metrics. Uniquely, Hudi takes 
full advantage of append support, for storage schemes that support it, like 
HDFS. This helps Hudi deliver streaming writes without causing an explosion in 
file counts/table metadata. Unfortunately, most cloud/object storages do not 
offer append capability today (except 
 maybe 
[Azure](https://azure.microsoft.com/en-us/updates/append-blob-support-in-azure-data-lake-storage-is-now-generally-available/)).
 In the future, we plan to leverage the lower-level APIs of major cloud object 
stores, to provide similar controls over file counts at streaming ingest 
latencies.

Review comment:
       hopefully `HoodieWrapperFileSystem` won't be moved to another package. 
would it better to use permalink here for "wrapper filesystem"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a change in pull request #3322: [BLOG] Apache Hudi is the Streaming Data Lake Platform

Reply via email to