xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478869049
########## website/docs/hudi_stack.md: ########## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + + +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats + +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format + +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table** : Implemented as a merge-on-read table, Hudi's [metadata table](https://hudi.apache.org/docs/next/metadata) efficiently handles quick updates with low write amplification. It leverages the HFile format for quick, indexed key lookups, storing vital information like file paths, column statistics, bloom filters, and record indexes. This approach streamlines operations by reducing the necessity for expensive cloud file listings. The metadata table in Hudi acts as an additional [indexing system](https://hudi.apache.org/docs/metadata#supporting-multi-modal-index-in-hudi) to uplevel the read and write performance. + +Hudi’s approach of recording updates into log files is more efficient and involves low merge overhead than systems like Hive ACID, where merging all delta records against all base files is required. Read more about the various table types in Hudi [here](https://hudi.apache.org/docs/table_types). + +## Indexes + +_Figure: Indexes in Hudi_ + +[Indexes](https://hudi.apache.org/docs/indexing) in Hudi enhance query planning, minimizing I/O, speeding up response times and providing faster writes with low merge costs. Hudi’s [metadata table](https://hudi.apache.org/docs/next/metadata/#metadata-table-indices) brings the benefits of indexes generally to both the readers and writers. Compute engines can leverage various indexes in the metadata table, like file listings, column statistics, bloom filters and record-level indexes to quickly generate optimized query plans and improve read performance. In addition to metadata table indexes, Hudi supports Simple, Bloom, HBase, and Bucket indexes, to efficiently locate the file group containing a specific record key. Hudi also provides reader indexes such as [functional](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md) and secondary indexes to boost reads. The table partitioning scheme in Hudi is consciously exploited for implementing global and non-global indexing st rategies. Review Comment: ```suggestion [Indexes](https://hudi.apache.org/docs/indexing) in Hudi enhance query planning, minimizing I/O, speeding up response times and providing faster writes with low merge costs. Hudi’s [metadata table](https://hudi.apache.org/docs/next/metadata/#metadata-table-indices) brings the benefits of indexes generally to both the readers and writers. Compute engines can leverage various indexes in the metadata table, like file listings, column statistics, bloom filters, record-level indexes, and [functional indexes](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md) to quickly generate optimized query plans and improve read performance. In addition to the metadata table indexes, Hudi supports Simple, Bloom, HBase, and Bucket indexes, to efficiently locate File Groups containing specific record keys. The table partitioning scheme in Hudi is consciously exploited for implementing global and non-global indexing strategies. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
