xushiyan commented on a change in pull request #3322: URL: https://github.com/apache/hudi/pull/3322#discussion_r676838941
########## File path: docs/_posts/2021-07-21-streaming-data-lake-platform.md ########## @@ -0,0 +1,150 @@ +--- +title: "Apache Hudi - The Streaming Data Lake Platform" +excerpt: "It's been called many things. But, we have always been building a data lake platform" +author: vinoth +category: blog +--- + +As early as 2016, we set out a [bold, new vision](https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/) reimagining batch data processing through a new “**incremental**” data processing stack - alongside the existing batch and streaming stacks. +While a stream processing pipeline does row-oriented processing, delivering a few seconds of processing latency, an incremental pipeline would apply the same principles to *columnar* data in the data lake, +delivering orders of magnitude improvements in processing efficiency within few minutes, on extremely scalable batch storage/compute infrastructure. This new stack would be able to effortlessly support regular batch processing for bulk reprocessing/backfilling as well. +Hudi was built as the manifestation of this vision, rooted in real, hard problems faced at [Uber](https://eng.uber.com/uber-big-data-platform/) and later took a life of its own in the open source community. Together, we have been able to +usher in fully incremental data ingestion and moderately complex ETLs on data lakes already. + + + +Today, this grand vision of being able to express almost any batch pipeline incrementally is more attainable than it ever was. Stream processing is [maturing rapidly](https://flink.apache.org/blog/) and gaining [tremendous momentum](https://www.confluent.io/blog/every-company-is-becoming-software/), +with [generalization](https://flink.apache.org/2021/03/11/batch-execution-mode.html) of stream processing APIs to work over a batch execution model. Hudi completes the missing pieces of the puzzle by providing streaming optimized lake storage, +much like how Kafka/Pulsar enable efficient storage for event streaming. [Many organizations](https://hudi.apache.org/docs/powered_by.html) have already reaped real benefits of adopting a streaming model for their data lakes, in terms of fresh data, simplified architecture and great cost reductions. + +But first, we needed to tackle the basics - transactions and mutability - on the data lake. In many ways, Apache Hudi pioneered the transactional data lake movement as we know it today. Specifically, during a time when more special-purpose systems were being born, Hudi introduced a server-less, transaction layer, which worked over the general-purpose Hadoop FileSystem abstraction on Cloud Stores/HDFS. This model helped Hudi to scale writers/readers to 1000s of cores on day one, compared to warehouses which offer a richer set of transactional guarantees but are often bottlenecked by the 10s of servers that need to handle them. We also experience a lot of joy to see similar systems (Delta Lake for e.g) later adopt the same server-less transaction layer model that we originally shared way back in [early '17](https://eng.uber.com/hoodie/). We consciously introduced two table types Copy On Write (with simpler operability) and Merge On Read (for greater flexibility) and now these terms ar e used in [projects](https://github.com/apache/iceberg/pull/1862) outside Hudi, to refer to similar ideas being borrowed from Hudi. Through open sourcing and [graduating](https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces64) from the Apache Incubator, we have made some great progress elevating these ideas [across the industry](http://hudi.apache.org/docs/powered_by.html), as well as bringing them to life with a cohesive software stack. Given the exciting developments in the past year or so that have propelled data lakes further mainstream, we thought some perspective can help users see Hudi with the right lens, appreciate what it stands for, and be a part of where it’s headed. At this time, we also wanted to shine some light on all the great work done by [180+ contributors](https://github.com/apache/hudi/graphs/contributors) on the project, working with more than 2000 unique users over slack/github/jira, contributing all the different capabilities H udi has gained over the past years, from its humble beginnings. + +This is going to be a rather long post, but we will do our best to make it worth your time. Let’s roll. + +## Data Lake Platform + +We have noticed that, Hudi is sometimes positioned as a “[table format](https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc)” or “transactional layer”. While this is not incorrect, this does not do full justice to all that Hudi has to offer. + +### Is Hudi a “format”? + +Hudi was not designed as a general purpose table format, tracking files/folders for batch processing. Rather, the functionality provided by a table format is merely one layer in the Hudi software stack. Hudi was designed to play well with the Hive format (if you will), given how popular and widespread it is. Over time, to solve scaling challenges or bring in additional functionality, we have invested in our own native table format with an eye for incremental processing vision. for e.g, we need to support shorter transactions that commit every few seconds. We believe these requirements would fully subsume challenges solved by general purpose table formats over time. But, we are also open to plugging in or syncing to other open table formats, so their users can also benefit from the rest of the Hudi stack. Unlike the file formats, a table format is merely a representation of table metadata and it’s actually quite possible to translate from Hudi to other formats/vice versa if users a re willing to accept the trade-offs. + +### Is Hudi a transactional layer? + +Of course, Hudi had to provide transactions for implementing deletes/updates, but Hudi’s transactional layer is designed around an [event log](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) that is also well-integrated with an entire set of built-in table/data services. For e.g compaction is aware of clustering actions already scheduled and optimizes by skipping over the files being clustered - while the user is blissfully unaware of all this. Hudi also provides out-of-box tools for ingesting, ETLing data, and much more. We have always been thinking of Hudi as solving a database problem around stream processing - areas that are actually [very related to each other](https://www.infoq.com/presentations/streaming-databases/). In fact, Stream processing is enabled by logs (capture/emit event streams, rewind/reprocess) and databases (state stores, updatable sinks). With Hudi, the idea was that if we buil d a database supporting efficient updates and extracting data streams while remaining optimized for large batch queries, incremental pipelines can be built using Hudi tables as state store & update-able sinks. + +Thus, the best way to describe Apache Hudi is as a **Streaming Data Lake Platform** built around a *database kernel*. The words carry significant meaning. + + + +**Streaming**: At its core, by optimizing for fast upserts & change streams, Hudi provides the primitives to data lake workloads that are comparable to what [Apache Kafka](https://kafka.apache.org/) does for event-streaming (namely, incremental produce/consume of events and a state-store for interactive querying). + +**Data Lake**: Nonetheless, Hudi provides an optimized, self-managing data plane for large scale data processing on the lake (adhoc queries, ML pipelines, batch pipelines), powering arguably the [largest transactional lake](https://eng.uber.com/apache-hudi-graduation/) in the world. While Hudi can be used to build a [lakehouse](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html), given its transactional capabilities, Hudi goes beyond and unlocks an end-to-end streaming architecture. In contrast, the word “streaming” appears just 3 times in the lakehouse [paper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf), and one of them is talking about Hudi. + +**Platform**: Oftentimes in open source, there is great tech, but there is just too many of them - all differing ever so slightly in their opinionated ways, ultimately making the integration task onerous on the end user. Lake users deserve the same great usability that cloud warehouses provide, with the additional freedom and transparency of a true open source community. Hudi’s data and table services, tightly integrated with the Hudi “kernel”, gives us the ability to deliver cross layer optimizations with reliability and ease of use. + +## Hudi Stack + +The following stack captures layers of software components that make up Hudi, with each layer depending on and drawing strength from the layer below. Typically, data lake users write data out once using an open file format like Apache [Parquet](http://parquet.apache.org/)/[ORC](https://orc.apache.org/) stored on top of extremely scalable cloud storage or distributed file systems. Hudi provides a self-managing data plane to ingest, transform and manage this data, in a way that unlocks incremental data processing on them. + + + +Furthermore, Hudi either already provides or plans to add components that make this data universally accessible to all the different query engines out there. The features annotated with `*` represent work in progress and dotted boxes represent planned future work, to complete our vision for the project. +While we have strawman designs outlined for the newer components in the blog, we welcome with open arms fresh perspectives from the community. +Rest of the blog will delve into each layer in our stack - explaining what it does, how it's designed for incremental processing and how it will evolve in the future. + +## Lake Storage + +Hudi interacts with lake storage using the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), which makes it compatible with all of its implementations ranging from HDFS to Cloud Stores to even in-memory filesystems like [Alluxio](https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/)/Ignite. Hudi internally implements its own [wrapper filesystem](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java) on top to provide additional storage optimizations (e.g: file sizing), performance optimizations (e.g: buffering), and metrics. Uniquely, Hudi takes full advantage of append support, for storage schemes that support it, like HDFS. This helps Hudi deliver streaming writes without causing an explosion in file counts/table metadata. Unfortunately, most cloud/object storages do not offer append capability today (except maybe [Azure](https://azure.microsoft.com/en-us/updates/append-blob-support-in-azure-data-lake-storage-is-now-generally-available/)). In the future, we plan to leverage the lower-level APIs of major cloud object stores, to provide similar controls over file counts at streaming ingest latencies. Review comment: hopefully `HoodieWrapperFileSystem` won't be moved to another package. would it better to use permalink here for "wrapper filesystem"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
