This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 99e573a9f39 [HUDI-7021] Add blog to introduce record level index
(#9970)
99e573a9f39 is described below
commit 99e573a9f399e868073b45b6b1208060ea0cfc49
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Fri Nov 3 21:58:35 2023 -0400
[HUDI-7021] Add blog to introduce record level index (#9970)
---------
Co-authored-by: Shiyan Xu <[email protected]>
---
website/blog/2023-11-01-record-level-index.md | 236 +++++++++++++++++++++
.../record-level-index/01.metadatatable_layout.png | Bin 0 -> 55537 bytes
.../blog/record-level-index/02.RLI_init_flow.png | Bin 0 -> 25705 bytes
.../blog/record-level-index/03.RLI_bulkinsert.png | Bin 0 -> 91719 bytes
.../blog/record-level-index/04.RLI_tagging.png | Bin 0 -> 147390 bytes
.../blog/record-level-index/index-latency.png | Bin 0 -> 80119 bytes
.../blog/record-level-index/write-latency.png | Bin 0 -> 74011 bytes
7 files changed, 236 insertions(+)
diff --git a/website/blog/2023-11-01-record-level-index.md
b/website/blog/2023-11-01-record-level-index.md
new file mode 100644
index 00000000000..ed0e84ad884
--- /dev/null
+++ b/website/blog/2023-11-01-record-level-index.md
@@ -0,0 +1,236 @@
+---
+title: "Record Level Index: Hudi's blazing fast indexing for large-scale
datasets"
+excerpt: "Announcing the Record Level Index in Apache Hudi"
+author: Shiyan Xu and Sivabalan Narayanan
+category: blog
+image: /assets/images/blog/record-level-index/03.RLI_bulkinsert.png
+tags:
+- design
+- index
+- speedup
+- metadata
+- improve write latency
+- improve read latency
+- apache hudi
+---
+
+## Introduction
+
+Index is a critical component that facilitates quick updates and deletes for
Hudi writers, and it plays a pivotal
+role in boosting query executions as well. Hudi provides several index types,
including the Bloom and Simple indexes with global
+variations, the HBase Index that leverages a HBase server, the hash-based
Bucket index, and the multi-modal index
+realized through the metadata table. The choice of an index depends on factors
such as table sizes, partition data distributions,
+or traffic patterns, where a specific index may be more suitable for simpler
operation or better performance[^1].
+Users often face trade-offs when selecting index types for different tables,
since there hasn't been
+a generally performant index capable of facilitating both writes and reads
with minimal operational overhead.
+
+Starting from [Hudi 0.14.0](https://hudi.apache.org/releases/release-0.14.0),
we are thrilled to announce a
+general purpose index for Apache Hudi - the Record Level Index (RLI). This
innovation not only dramatically boosts
+write efficiency but also improves read efficiency for relevant queries.
Integrated seamlessly within the table storage layer,
+RLI can easily work without any additional operational efforts.
+
+In the subsequent sections of this blog, we will give a brief introduction to
Hudi's metadata table, a pre-requisite for discussing RLI.
+Following that, we will delve into the design and workflows of RLI, and then
show performance analysis and index type comparisons. The blog
+will conclude with insights into future work for RLI.
+
+## Metadata table
+
+A [Hudi metadata table](https://hudi.apache.org/docs/metadata) is a
Merge-on-Read (MoR) table within the `.hoodie/metadata/` directory. It contains
various
+metadata pertaining to records, seamlessly integrated into both the writer and
reader paths to improve indexing efficiency.
+The metadata is segregated into four partitions: `files`, `column stats`,
`bloom filters`, and `record level index`.
+
+<img src="/assets/images/blog/record-level-index/01.metadatatable_layout.png"
alt="Hudi metadata table layout" width="800" align="middle"/>
+
+The metadata table is updated synchronously with each commit action on the
Timeline, in other words, the commits to the
+metadata table are part of the transactions to the Hudi data table. With four
partitions containing different types of
+metadata, this layout serves the purpose of a multi-modal index:
+
+- `files` partition keeps track of the Hudi data table’s partitions, and data
files of each partition
+- `column stats` partition records statistics about each column of the data
table
+- `bloom filter` partition stores serialized bloom filters for base files
+- `record level index` partition contains mappings of individual record key
and the corresponding file group id
+
+Users can activate the metadata table by setting
`hoodie.metadata.enable=true`. Once activated, the `files` partition
+will always be enabled. Other partitions can be enabled and configured
individually to harness additional indexing
+capabilities.
+
+## Record Level Index
+
+Starting from release 0.14.0, the Record Level Index (RLI) can be activated by
setting `hoodie.metadata.record.index.enable=true`
+and `hoodie.index.type=RECORD_INDEX`. The core concept behind RLI is the
ability to determine the location of records, thus
+reducing the number of files that need to be scanned to extract the desired
data. This process is usually referred to as "index look-up".
+Hudi employs a primary-key model, requiring each record to be associated with
a key
+to satisfy the uniqueness constraint. Consequently, we can establish
one-to-one mappings between record keys and file groups,
+precisely the data we intend to store within the `record level index`
partition.
+
+Performance is paramount when it comes to indexes. The metadata table, which
includes the RLI partition, chooses
[HFile](https://hbase.apache.org/book.html#_hfile_format_2)[^2],
+HBase’s file format that utilizes B+ tree-like structures for fast look-up, as
the file format. Real-world benchmarking
+has shown that an HFile containing 1 million RLI mappings can look up a batch
of 100k records in just 600 ms.
+We will cover the performance topic in a later section with detailed analysis.
+
+### Initialization
+
+Initializing the RLI partition for an existing Hudi table can be a laborious
and time-consuming task, contingent on the number
+of records. Just like with a typical database, building indexes takes time,
but the investment ultimately pays off by speeding up
+numerous queries in the future.
+
+<img src="/assets/images/blog/record-level-index/02.RLI_init_flow.png"
alt="RLI init flow" width="800" align="middle"/>
+
+The diagram above shows the high-level steps of RLI initialization. Since
these jobs are all parallelizable, users can
+scale the cluster and configure relevant parallelism settings (e.g.,
`hoodie.metadata.max.init.parallelism`) accordingly
+to meet their time requirement.
+
+Focusing on the final step, "Bulk insert to RLI partition," the metadata table
writer employs a hash function to
+partition the RLI records, ensuring that the number of resulting file groups
aligns with the number of partitions.
+This guarantees consistent record key look-ups.
+
+<img src="/assets/images/blog/record-level-index/03.RLI_bulkinsert.png"
alt="RLI bulkinsert" width="800" align="middle"/>
+
+It’s important to note that the current implementation fixes the number of
file groups in the RLI partition once it’s initialized.
+Therefore, users should lean towards over-provisioning the file groups and
adjust these configurations accordingly.
+
+```
+hoodie.metadata.record.index.max.filegroup.count
+hoodie.metadata.record.index.min.filegroup.count
+hoodie.metadata.record.index.max.filegroup.size
+hoodie.metadata.record.index.growth.factor
+```
+
+In future development iterations, RLI should be able to overcome this
limitation by dynamically rebalancing file groups to
+accommodate the ever-increasing number of records.
+
+### Updating RLI upon data table writes
+
+During regular writes, the RLI partition will be updated as part of the
transactions. Metadata records will be generated
+using the incoming record keys with their corresponding location info. Given
that the RLI partition contains the exact
+mappings of record keys and locations, upserts to the data table will result
in upsertion of the corresponding keys to the
+RLI partition, The hash function employed will guarantee that identical keys
are routed to the same file group.
+
+### Writer Indexing
+
+Being part of the write flow, RLI follows the high-level indexing flow,
similar to any other global index: for a given
+set of records, it tags each record with location information if the index
finds them present in any existing file group.
+The key distinction lies in the source of truth for the existence test—the RLI
partition. The diagram below illustrates
+the tagging flow with detailed steps.
+
+<img src="/assets/images/blog/record-level-index/04.RLI_tagging.png" alt="RLI
tagging" width="800" align="middle"/>
+
+The tagged records will be passed to Hudi write handles and will undergo write
operations to their respective file groups.
+The indexing process is a critical step in applying updates to the table, as
its efficiency directly influences the write
+latency. In a later section, we will demonstrate the Record Level Index
performance using benchmarking results.
+
+### Read Flow
+
+The Record Level Index is also integrated on the query side[^3]. In queries
that involve equality check (e.g., EqualTo or IN)
+against the record key column, Hudi’s file index implementation optimizes the
file pruning process. This optimization is
+achieved by leveraging RLI to precisely locate the file groups that need to be
read for completing the queries.
+
+### Storage
+
+Storage efficiency is another vital aspect of the design. Each RLI mapping
entry must include some necessary information
+to precisely locate files, such as record key, partition path, file group id,
etc. To optimize the storage, RLI adopts
+some compression techniques such as encoding file group id (in the form of
UUID) into 2 Longs to represent the high and
+low bits. Using Gzip compression and a 4MB block size, an individual RLI
record averages only 48 bytes in size. To
+illustrate this more practically, let’s assume we have a table of 100TB data
with about 1 billion records (average record size = 100Kb).
+The storage space required by the RLI partition will be approximately 48 Gb,
which is less than 0.05% of the total data size.
+Since RLI contains the same number of entries as the data table, storage
optimization is crucial to make RLI practical,
+especially for tables of petabyte size and beyond.
+
+RLI exploits the low cost of storage to enable the rapid look-up process
similar to the HBase index, while avoiding the
+operational overhead of running an extra server. In the next section, we will
review some benchmarking results to demonstrate
+its performance advantages.
+
+### Performance
+
+We conducted a comprehensive benchmarking analysis of the Record Level Index
evaluating aspects such write latency,
+index look-up latency, and data shuffling in comparison to existing indexing
mechanisms in Hudi. In addition to the
+benchmarks for write operations, we will also showcase the reduction in query
latencies for point look-ups. Hudi 0.14.0
+and Spark 3.2.1 were used throughout the experiments.
+
+In comparison to the Global Simple Index (GSI) in Hudi, Record Level Index
(RLI) is crafted for significant performance
+advantages stemming from a greatly reduced scan space and minimized data
shuffling. GSI conducts join operations between
+incoming records and existing data across all partitions of the data table,
resulting in substantial data shuffling and
+computational overhead to pinpoint the records. On the other hand, RLI
efficiently extracts location info through a
+hash function, leading to a considerably smaller amount of data shuffling by
only loading the file groups of interest
+from the metadata table.
+
+#### Write latency
+
+In the first set of experiments, we established two pipelines: one configured
using GSI, and the other configured with RLI.
+Each pipeline was executed on an EMR cluster of 10 m5.4xlarge core instances,
and was set to ingest batches of 200Mb data
+into a 1TB dataset of 2 billion records. The RLI partition was configured with
1000 file groups. For N batches of ingestion,
+**the average write latency using RLI showed a remarkable 72% improvement over
GSI**.
+
+<img src="/assets/images/blog/record-level-index/write-latency.png"
alt="metadata-rli" width="600" align="middle"/>
+
+Note: Between Global Simple Index and Global Bloom Index in Hudi, the former
yielded better results due to the randomness
+of record keys. Therefore, we omitted the presentation of the Global Bloom
Index in the chart.
+
+#### Index look-up latency
+
+We also isolated the index look-up step using HoodieReadClient to accurately
gauge indexing efficiency. Through
+experiments involving the look-up of 400,000 records (0.02%) in a 1TB dataset
of 2 billion records, **RLI showcased a
+72% improvement over GSI, consistent with the end-to-end write latency
results**.
+
+<img src="/assets/images/blog/record-level-index/write-latency.png"
alt="index-latency" width="600" align="middle"/>
+
+#### Data shuffling
+
+In the index look-up experiments, we observed that around 85Gb of data was
shuffled for GSI, whereas only 700Mb was shuffled
+for RLI. **This reflects an impressive 92% reduction in data shuffling when
using RLI compared to GSI**.
+
+#### Query latency
+
+The Record Level Index will greatly boost Spark queries with “EqualTo” and
“IN” predicates on record key columns.
+We created a 400GB Hudi table comprising 20,000 file groups. When we executed
a query predicated on a single record key,
+we observed a significant improvement in query time. **With RLI enabled, the
query time decreased from 977 seconds to just
+12 seconds, representing an impressive 98% reduction in latency**[^4].
+
+### When to Use
+
+RLI demonstrates outstanding performance in general, elevating update and
delete efficiency to a new level and
+fast-tracking reads when executing key-matching queries. Enabling RLI is also
as simple as setting some configuration flags.
+Below, we have summarized a comparison table highlighting these important
characteristics of RLI in contrast to other common Hudi index types.
+
+| | Record Level Index | Global Simple Index |
Global Bloom Index | HBase Index | Bucket Index |
+|-------------------------------|--------------------|---------------------|--------------------|--------------------------------------|----------------|
+| Performant look-up in general | Yes | No |
No | Yes, with possible throttling issues | Yes |
+| Boost both writes and reads | Yes | No, write-only |
No, write-only | No, write-only | No, write-only |
+| Easy to enable | Yes | Yes |
Yes | No, require HBase server | Yes |
+
+Many real-world applications will significantly benefit from using RLI. A
common example is fulfilling the GDPR requirements.
+Typically, when users make requests, a set of IDs will be provided to identify
the to-be-deleted records,
+which will either be updated (columns being nullified) or permanently removed.
+By enabling RLI, offline jobs performing such changes will become notably more
efficient, resulting in cost savings.
+On the read side, analysts or engineers collecting historical events through
certain tracing IDs will also
+experience blazing fast responses from the key-matching queries.
+
+While RLI holds the above-mentioned advantages over all other index types, it
is important to consider certain
+aspects when using it. Similar to any other global index, RLI requires
record-key uniqueness across all partitions in a table.
+As RLI keeps track of all record keys and locations, the initialization
process may take time for large tables.
+In scenarios with extremely skewed large workloads, RLI might not achieve the
desired performance due to limitations in the current design.
+
+## Future Work
+
+In this initial version of the Record Level Index, certain limitations are
acknowledged. As mentioned in the
+"Initialization" section, the number of file groups must be predetermined
during the creation of the RLI partition.
+Hudi does use some heuristics and a growth factor for an existing table, but
for a new table, it is recommended to
+set appropriate file group configs for RLI. As the data volume increases, the
RLI partition requires re-bootstrapping
+when additional file groups are needed for scaling out. To address the need
for rebalancing, a consistent hashing
+technique could be employed.
+
+Another valuable enhancement would involve supporting the indexing of
secondary columns alongside the record key
+fields, thus catering to a broader range of queries. On the reader side, there
is a plan to integrate more query
+engines, such as Presto and Trino, with the Record Level Index to fully
leverage the performance benefits offered
+by Hudi metadata tables.
+
+
+---
+
+[^1] [This
blog](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/)
well-explained some best practices regarding index selection and configuration.
+
+[^2] Other formats like Parquet can also be supported in the future.
+
+[^3] As of now, query engine integration is only available for Spark, with
plans to support additional engines in the future.
+
+[^4] The query improvement is specific to record-key-matching queries and does
not reflect a general reduction in latency by enabling RLI. In the case of the
single record-key query, 99.995% of file groups (19999 out of 20000) were
pruned during query execution.
diff --git
a/website/static/assets/images/blog/record-level-index/01.metadatatable_layout.png
b/website/static/assets/images/blog/record-level-index/01.metadatatable_layout.png
new file mode 100644
index 00000000000..32e6f5c2274
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/01.metadatatable_layout.png
differ
diff --git
a/website/static/assets/images/blog/record-level-index/02.RLI_init_flow.png
b/website/static/assets/images/blog/record-level-index/02.RLI_init_flow.png
new file mode 100644
index 00000000000..03efc9b2c0a
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/02.RLI_init_flow.png
differ
diff --git
a/website/static/assets/images/blog/record-level-index/03.RLI_bulkinsert.png
b/website/static/assets/images/blog/record-level-index/03.RLI_bulkinsert.png
new file mode 100644
index 00000000000..313d1fa7208
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/03.RLI_bulkinsert.png
differ
diff --git
a/website/static/assets/images/blog/record-level-index/04.RLI_tagging.png
b/website/static/assets/images/blog/record-level-index/04.RLI_tagging.png
new file mode 100644
index 00000000000..05dffcea4aa
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/04.RLI_tagging.png differ
diff --git
a/website/static/assets/images/blog/record-level-index/index-latency.png
b/website/static/assets/images/blog/record-level-index/index-latency.png
new file mode 100644
index 00000000000..0ddc71f730d
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/index-latency.png differ
diff --git
a/website/static/assets/images/blog/record-level-index/write-latency.png
b/website/static/assets/images/blog/record-level-index/write-latency.png
new file mode 100644
index 00000000000..aede4ef1c37
Binary files /dev/null and
b/website/static/assets/images/blog/record-level-index/write-latency.png differ