bhasudha commented on code in PR #9406: URL: https://github.com/apache/hudi/pull/9406#discussion_r1306234076
########## website/docs/metadata.md: ########## @@ -3,80 +3,173 @@ title: Metadata Table keywords: [ hudi, metadata, S3 file listings] --- -## Motivation for a Metadata Table +## Metadata Table + +Database indices contain auxiliary data structures to quickly locate records needed, without reading unnecessary data +from storage. Given that Hudi’s design has been heavily optimized for handling mutable change streams, with different +write patterns, Hudi considers [indexing](#indexing) as an integral part of its design and has uniquely supported +[indexing capabilities](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) from its inception, to speed +up upserts on the Data Lakehouse. While Hudi's indices has benefited writers for fast upserts and deletes, Hudi's metadata table +aims to tap these benefits more generally for both the readers and writers. The metadata table implemented as a single +internal Hudi Merge-On-Read table hosts different types of indices containing table metadata and is designed to be +serverless and independent of compute and query engines. This is similar to common practices in databases where metadata +is stored as internal views. + +The metadata table aims to significantly improve read/write performance of the queries by addressing the following key challenges: +- **Eliminate the requirement of `list files` operation**:<br /> + When reading and writing data, file listing operations are performed to get the current view of the file system. + When data sets are large, listing all the files may be a performance bottleneck, but more importantly in the case of cloud storage systems + like AWS S3, the large number of file listing requests sometimes causes throttling due to certain request limits. + The metadata table will instead proactively maintain the list of files and remove the need for recursive file listing operations +- **Expose columns stats through indices for better query planning and faster lookups by readers**:<br /> + Query engines rely on techniques such as partitioning and file pruning to cut down on the amount of irrelevant data + scanned for query planning and execution. During query planning phase all data files are read for metadata on range + information of columns for further pruning data files based on query predicates and available range information. This + approach is expensive and does not scale if there are large number of partitions and data files to be scanned. In + addition to storage optimizations such as automatic file sizing, clustering, etc that helps data organization in a query + optimized way, Hudi's metadata table improves query planning further by supporting multiple types of indices that aid + in efficiently looking up data files based on relevant query predicates instead of reading the column stats from every + individual data file and then pruning. + +## Supporting Multi-Modal Index in Hudi + +[Multi-modal indexing](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi), +introduced in [0.11.0 Hudi release](https://hudi.apache.org/releases/release-0.11.0/#multi-modal-index), +is a re-imagination of what a general purpose indexing subsystem should look like for the lake. Multi-modal indexing is +implemented by enhancing Hudi's metadata table with the flexibility to extend to new index types as new partitions, +along with an [asynchronous index](https://hudi.apache.org/docs/metadata_indexing/#setup-async-indexing) building +mechanism and is built on the following core principles: +- **Scalable metadata**: The table metadata, i.e., the auxiliary data about the table, must be scalable to extremely + large size, e.g., Terabytes (TB). Different types of indices should be easily integrated to support various use cases + without having to worry about managing the same. To realize this, all indices in Hudi's metadata table are stored as + partitions in a single internal MOR table. The MOR table layout enables lightning-fast writes by avoiding synchronous + merge of data with reduced write amplification. This is extremely important for large datasets as the size of updates to the + metadata table can grow to be unmanageable otherwise. This helps Hudi to scale metadata to TBs of sizes. The + foundational framework for multi-modal indexing is built to enable and disable new indices as needed. The + [async indexing](https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi) supports index building alongside + regular writers without impacting the write latency. +- **ACID transactional updates**: The index and table metadata must be always up-to-date and in sync with the data table. + This is designed via multi-table transaction within Hudi and ensures atomicity of writes and resiliency to failures so that + partial writes to either the data or metadata table are never exposed to other read or write transactions. The metadata + table is built to be self-managed so users don’t need to spend operational cycles on any table services including + compaction and cleaning +- **Fast lookup**: The needle-in-a-haystack type of lookups must be fast and efficient without having to scan the entire + index, as index size can be TBs for large datasets. Since most access to the metadata table are point and range lookups, + the HFile format is chosen as the base file format for the internal metadata table. Since the metadata table stores + the auxiliary data at the partition level (files index) or the file level (column_stats index), the lookup based on a Review Comment: @codope , these talk about how the multi modal indexing subsystem is built (using the HFile format for fast lookups on the internal metadata table.). This paragraph does not refer to workload categorization where we recommend different indices. Do you agree? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
