This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 57c6c1161dfa docs(blog): add Hudi indexing subsystem blog part 1 of 2
(#14179)
57c6c1161dfa is described below
commit 57c6c1161dfac108f25ca1b9f4669c0f8a0410eb
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Oct 29 12:28:48 2025 -0700
docs(blog): add Hudi indexing subsystem blog part 1 of 2 (#14179)
---
...ve-into-hudis-indexing-subsystem-part-1-of-2.md | 213 +++++++++++++++++++++
.../fig1.png | Bin 0 -> 57480 bytes
.../fig2.png | Bin 0 -> 34132 bytes
.../fig3.png | Bin 0 -> 62125 bytes
.../fig4.png | Bin 0 -> 63637 bytes
.../fig5.png | Bin 0 -> 55763 bytes
.../fig6.png | Bin 0 -> 55373 bytes
7 files changed, 213 insertions(+)
diff --git
a/website/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2.md
b/website/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2.md
new file mode 100644
index 000000000000..e1cd9a03d495
--- /dev/null
+++
b/website/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2.md
@@ -0,0 +1,213 @@
+---
+title: "Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)"
+excerpt: ""
+author: Shiyan Xu
+category: blog
+image:
/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig1.png
+tags:
+- hudi
+- indexing
+- data lakehouse
+- data skipping
+---
+
+For decades, databases have relied on indexes—specialized data structures—to
dramatically improve read and write performance by quickly locating specific
records. Apache Hudi extends this fundamental principle to the data lakehouse
with a unique and powerful approach. Every Hudi table contains a self-managed
metadata table that functions as an indexing subsystem, enabling efficient data
skipping and fast record lookups across a wide range of read and write
scenarios.
+
+This two-part series dives into Hudi’s indexing subsystem. Part 1 explains the
internal layout and data-skipping capabilities. Part 2 covers advanced
features—record, secondary, and expression indexes—and asynchronous index
maintenance. By the end, you’ll know how to leverage Hudi’s multimodal index to
build more efficient lakehouse tables.
+
+## The Metadata Table
+
+Within a Hudi table (the data table), the metadata table itself is a Hudi
Merge-on-Read (MOR) table. Unlike a typical data table, it features a
specialized layout. The table is physically partitioned by index type, with
each partition containing the relevant index entries. For its physical storage,
the metadata table uses HFile as the base file format. This choice is
deliberate: HFile is exceptionally efficient at handling key lookups—the
predominant query pattern for indexing. Let’s exp [...]
+
+### Multimodal indexing
+
+The metadata table is often referred to as a multimodal index because it
houses a diverse range of index types, providing versatile capabilities to
accelerate various query patterns. The following diagram illustrates the layout
of the metadata table and its relationship with the main data table.
+
+
+
+The metadata table is located in the `.hoodie/metadata/` directory under the
data table’s base path. It contains partitions for different indexes, such as
the files index (under the `files/` partition) for tracking the data table’s
partitions and files, and the column stats index (under the `column_stats/`
partition) for tracking file-level statistics (e.g., min/max values) for
specific columns. Each index partition stores mapping entries tailored to its
specific purpose.
+
+This partitioned design provides great flexibility, allowing you to enable
only the indexes that suit your workload. It also ensures extensibility, making
it straightforward to support new index types in the future. For example, the
[bitmap index](https://github.com/apache/hudi/blob/master/rfc/rfc-92/rfc-92.md)
and the vector search index are on the
[roadmap](https://hudi.apache.org/roadmap) and will be maintained in their own
dedicated partitions.
+
+When committing to a data table, the metadata table is updated within the same
transactional write. This crucial step ensures that index entries are always
synchronized with data table records, upholding data integrity across the
table. Therefore, choosing Merge-on-Read (MOR) as the table type for the
metadata table is an obvious choice. MOR offers the advantage of absorbing
high-frequency write operations, preventing the metadata table’s update process
from becoming a bottleneck for ove [...]
+
+### HFile format
+
+The HFile format stores key-value pairs in a sorted, immutable, and
block-indexed way, modeled after Google’s SSTable introduced by the [Bigtable
paper](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf).
Here is the description of SSTable quoted from the paper:
+
+> An SSTable provides a persistent, ordered immutable map from keys to values,
where both keys and values are arbitrary byte strings. Operations are provided
to look up the value associated with a specified key, and to iterate over all
key/value pairs in a specified key range. Internally, each SSTable contains a
sequence of blocks (typically each block is 64KB in size, but this is
configurable). A block index (stored at the end of the SSTable) is used to
locate blocks; the index is loade [...]
+
+As you can tell, by implementing the SSTable, HFile is especially efficient at
performing random access, which is the primary query pattern for indexing—given
a specific piece of information, like a record key or a partition value, return
matching results, such as the file ID that contains the record key, or the list
of files that belong to the partition.
+
+
+
+Because the keys in an HFile are stored in lexicographic order, a batched
lookup with a common key prefix is also highly efficient, requiring only a
sequential read of nearby keys.
+
+### Default behaviors
+
+When a Hudi table is created, the metadata table will be enabled with three
partitions by default: *files*, *column stats*, and *partition stats*:
+
+* **Files**: stores the list of all partitions and the lists of all base files
and log files of each partition, located at the `files/` partition of the
metadata table.
+* **Column stats**: stores file-level statistics like min, max, value count,
and null count for specified columns, located at the `column_stats/` partition
of the metadata table.
+* **Partition stats**: stores partition-level statistics like min, max, value
count, and null count for specified columns, located at the `partition_stats/`
partition of the metadata table.
+
+By default, when no column is specified for column\_stats and
partition\_stats, Hudi will index the first 32 columns (controlled by
`hoodie.metadata.index.column.stats.max.columns.to.index`) available in the
table schema.
+
+Whenever a new write is performed on the data table, the metadata table will
be updated accordingly. For any available index, new index entries will be
upserted to its corresponding partition. For example, if the new write creates
a new partition in the data table with some new base files, the files partition
will be updated and contain the latest partition and file lists. Similarly, the
column stats and partition stats partitions will receive new entries indicating
the updated statistic [...]
+
+Note that by design, you cannot disable the files partition, as it is a
fundamental index that serves both read and write processes. You can still,
although not recommended, disable the entire metadata table by setting
`hoodie.metadata.enable=false` during a write.
+
+We will discuss more details about how the default indexes work to improve
read and write performance. We will also introduce more indexes supported by
the metadata table with usage examples in the following sections.
+
+## Data Skipping with Files, Column Stats, and Partition Stats
+
+Data skipping is a core optimization technique that avoids unnecessary data
scanning. Its most basic form is physical partitioning, where data is organized
into directories based on columns like `order_date` in a customer order table.
When a query filters on a partitioned column, the engine uses *partition
pruning* to read only the relevant directories. More advanced techniques store
lightweight statistics—such as min/max values—for data within each file. The
query engine consults this m [...]
+
+### The data skipping process
+
+Hudi’s indexing subsystem implements a multi-level skipping strategy using a
combination of indexes. Query engines like Spark or Trino can leverage Hudi’s
files, partition stats, and column stats indexes to improve performance
dramatically. The process, illustrated in the figure below, unfolds in several
stages.
+
+
+
+First, the query engine parses the input SQL and extracts relevant filter
predicates, such as `price >= 300`. These predicates are pushed down to Hudi’s
integration component, which manages the index lookup process.
+
+The component then consults the files index to get an initial list of
partitions. It prunes this list using the partition stats index, which holds
partition-level statistics like min/max values. For example, any partition with
a maximum price below 300 is skipped entirely.
+
+After this initial pruning, the component consults the files index again to
retrieve the list of data files within the remaining partitions. This file list
is pruned further using the column stats index, which provides the same min/max
statistics at the file level.
+
+This multi-step process ensures that the query engine reads only the minimum
set of files required to satisfy the query, significantly reducing the total
amount of data processed.
+
+### SQL examples
+
+The following examples demonstrate data skipping in action. We will create a
Hudi table and execute Spark SQL queries against it, starting with both
partition and column stats disabled to establish a baseline.
+
+```sql
+CREATE TABLE orders (
+ order_id STRING,
+ price DECIMAL(12,2),
+ order_status STRING,
+ update_ts BIGINT,
+ shipping_date DATE,
+ shipping_country STRING
+) USING HUDI
+PARTITIONED BY (shipping_country)
+OPTIONS (
+ primaryKey = 'order_id',
+ preCombineField = 'update_ts',
+ hoodie.metadata.index.column.stats.enable = 'false',
+ hoodie.metadata.index.partition.stats.enable = 'false'
+);
+```
+
+And insert some sample data:
+
+```sql
+INSERT INTO orders VALUES
+('ORD001', 389.99, 'PENDING', 17495166353, DATE '2023-01-01', 'A'),
+('ORD002', 199.99, 'CONFIRMED', 17495167353, DATE '2023-01-01', 'A'),
+('ORD003', 59.50, 'SHIPPED', 17495168353, DATE '2023-01-11', 'B'),
+('ORD004', 99.00, 'PENDING', 17495169353, DATE '2023-02-09', 'B'),
+('ORD005', 19.99, 'PENDING', 17495170353, DATE '2023-06-12', 'C'),
+('ORD006', 5.99, 'SHIPPED', 17495171353, DATE '2023-07-31', 'C');
+```
+
+#### Only the files index
+
+With both column stats and partition stats disabled, only the files index is
built during the insert operation. We’ll use the SQL below for our test:
+
+```sql
+SELECT order_id, price, shipping_country
+FROM orders
+WHERE price > 300;
+```
+
+This query looks for orders with price greater than 300, which only exist in
partition 'A' (shipping\_country = 'A'). After running the SQL, here's what we
see in the Spark UI:
+
+
+
+Spark read all 3 partitions and 3 files to find potential matches, but only 1
record from partition A actually satisfied the query condition.
+
+#### Enabling column stats
+
+Now let's enable column stats while keeping partition stats disabled. Note
that we can't do it the other way around—partition stats requires column stats
to be enabled first.
+
+```sql
+CREATE TABLE orders (
+ order_id STRING,
+ price DECIMAL(12,2),
+ order_status STRING,
+ update_ts BIGINT,
+ shipping_date DATE,
+ shipping_country STRING
+) USING HUDI
+PARTITIONED BY (shipping_country)
+OPTIONS (
+ primaryKey = 'order_id',
+ preCombineField = 'update_ts',
+ hoodie.metadata.index.column.stats.enable = 'true',
+ hoodie.metadata.index.partition.stats.enable = 'false'
+);
+```
+
+Running the same SQL gives us this in the Spark UI:
+
+
+
+Now it shows all 3 partitions but only 1 file was scanned. Without partition
stats, the query engine couldn't prune partitions, but column stats
successfully filtered out the non-matching files. The compute cost of examining
those 2 irrelevant partitions and their files could have been avoided with
partition stats enabled.
+
+#### Enabling column stats and partition stats
+
+Now let's enable partition stats as well. Since both indexes are enabled by
default in Hudi 1.x, we can simply omit those additional configs from the
CREATE statement:
+
+```sql
+CREATE TABLE orders (
+ order_id STRING,
+ price DECIMAL(12,2),
+ order_status STRING,
+ update_ts BIGINT,
+ shipping_date DATE,
+ shipping_country STRING
+) USING HUDI
+PARTITIONED BY (shipping_country)
+OPTIONS (
+ primaryKey = 'order_id',
+ preCombineField = 'update_ts'
+);
+```
+
+Running the same SQL gives us this in the Spark UI:
+
+
+
+Now we see the full pruning effect happened—only 1 relevant partition and 1
relevant file were scanned, thanks to both indexes working together. [This
blog](https://hudi.apache.org/blog/2025/10/22/Partition_Stats_Enhancing_Column_Stats_in_Hudi_1.0/)
shows a 93% reduction in query time running on a 1 TB dataset.
+
+### Configure relevant columns to be indexed
+
+By default, Hudi indexes the first 32 columns for both partition stats and
column stats. This limit prevents excessive metadata overhead—each indexed
column requires computing min, max, null-count, and value-count statistics for
every partition and data file. In most cases, you only need to index a small
subset of columns that are frequently used in query predicates. You can specify
which columns to be indexed to reduce the maintenance costs:
+
+```sql
+CREATE TABLE orders (
+ order_id STRING,
+ price DECIMAL(12,2),
+ order_status STRING,
+ update_ts BIGINT,
+ shipping_date DATE,
+ shipping_country STRING
+) USING HUDI
+PARTITIONED BY (shipping_country)
+OPTIONS (
+ primaryKey = 'order_id',
+ preCombineField = 'update_ts',
+ 'hoodie.metadata.index.column.stats.column.list' = 'price,shipping_date'
+);
+```
+
+The config `hoodie.metadata.index.column.stats.column.list` applies to both
partition stats and column stats. By indexing just the `price` and
`shipping_date` columns, queries filtering on price comparisons or shipping
date ranges will already see significant performance improvements.
+
+## Key Takeaways and What's Next
+
+Hudi’s metadata table is itself a Hudi Merge‑on‑Read (MOR) table that acts as
a multimodal indexing subsystem. It is physically partitioned by index type
(for example, `files/`, `column_stats/`, `partition_stats/`) and stores base
files in the HFile (SSTable‑like) format. This layout provides fast point
lookups and efficient batched scans by key prefix—exactly the access patterns
indexing needs at lakehouse scale.
+
+Index maintenance happens transactionally alongside data writes, keeping index
entries consistent with the data table. Periodic compaction merges log files
into read‑optimized HFile base files to keep point lookups fast and
predictable. On the read path, Hudi composes multiple indexes to minimize I/O:
the files index enumerates candidates, partition stats prune irrelevant
partitions, and column stats prune non‑matching files. In effect, the engine
scans only the minimum set of files requ [...]
+
+In practice, the defaults are a strong starting point. Keep the metadata table
enabled and explicitly list only the columns you frequently filter on via
`hoodie.metadata.index.column.stats.column.list` to control metadata overhead.
In Part 2, we’ll go deeper into accelerating equality‑matching and
expression‑based predicates using the record, secondary, and expression
indexes, and discuss how asynchronous index maintenance keeps writers unblocked
while indexes build in the background.
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig1.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig1.png
new file mode 100644
index 000000000000..9c0da0c1802d
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig1.png
differ
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig2.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig2.png
new file mode 100644
index 000000000000..3ed598e50102
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig2.png
differ
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig3.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig3.png
new file mode 100644
index 000000000000..ffd98ec3b524
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig3.png
differ
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig4.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig4.png
new file mode 100644
index 000000000000..7a170b1db5ec
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig4.png
differ
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig5.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig5.png
new file mode 100644
index 000000000000..32a99550b488
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig5.png
differ
diff --git
a/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig6.png
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig6.png
new file mode 100644
index 000000000000..f34cf0f8d996
Binary files /dev/null and
b/website/static/assets/images/blog/2025-10-29-deep-dive-into-hudis-indexing-subsystem-part-1-of-2/fig6.png
differ