[hudi] branch asf-site updated: [DOCS] Update Metadata table and metadata indexing related pages (#9406)

codope Mon, 28 Aug 2023 03:23:30 -0700

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new e597bd0dd33 [DOCS] Update Metadata table and metadata indexing related 
pages (#9406)
e597bd0dd33 is described below

commit e597bd0dd33fb1cd1f28ff0b28298d0d52e0030d
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Mon Aug 28 03:21:36 2023 -0700

    [DOCS] Update Metadata table and metadata indexing related pages (#9406)
    
    Summary:
    - Add inline configs
    - Add specific index descriptions
    - Add high level context
    - Restructure content for readability
---
 website/docs/metadata.md           | 211 ++++++++++++++++++++++++++-----------
 website/docs/metadata_indexing.md  |  21 ++--
 website/src/theme/DocPage/index.js |   2 +-
 3 files changed, 160 insertions(+), 74 deletions(-)

diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index e0ffa5b3b21..d26d0bdd849 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -3,80 +3,173 @@ title: Metadata Table
 keywords: [ hudi, metadata, S3 file listings]
 ---
 
-## Motivation for a Metadata Table
+## Metadata Table
+
+Database indices contain auxiliary data structures to quickly locate records 
needed, without reading unnecessary data 
+from storage. Given that Hudi’s design has been heavily optimized for handling 
mutable change streams, with different 
+write patterns, Hudi considers [indexing](#indexing) as an integral part of 
its design and has uniquely supported 
+[indexing 
capabilities](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/)
 from its inception, to speed 
+up upserts on the Data Lakehouse. While Hudi's indices has benefited writers 
for fast upserts and deletes, Hudi's metadata table 
+aims to tap these benefits more generally for both the readers and writers. 
The metadata table implemented as a single 
+internal Hudi Merge-On-Read table hosts different types of indices containing 
table metadata and is designed to be
+serverless and independent of compute and query engines. This is similar to 
common practices in databases where metadata
+is stored as internal views.
+
+The metadata table aims to significantly improve read/write performance of the 
queries by addressing the following key challenges:
+- **Eliminate the requirement of `list files` operation**:<br />
+  When reading and writing data, file listing operations are performed to get 
the current view of the file system.
+  When data sets are large, listing all the files may be a performance 
bottleneck, but more importantly in the case of cloud storage systems
+  like AWS S3, the large number of file listing requests sometimes causes 
throttling due to certain request limits.
+  The metadata table will instead proactively maintain the list of files and 
remove the need for recursive file listing operations
+- **Expose columns stats through indices for better query planning and faster 
lookups by readers**:<br />
+  Query engines rely on techniques such as partitioning and file pruning to 
cut down on the amount of irrelevant data 
+  scanned for query planning and execution. During query planning phase all 
data files are read for metadata on range 
+  information of columns for further pruning data files based on query 
predicates and available range information. This
+  approach is expensive and does not scale if there are large number of 
partitions and data files to be scanned. In
+  addition to storage optimizations such as automatic file sizing, clustering, 
etc that helps data organization in a query
+  optimized way, Hudi's metadata table improves query planning further by 
supporting multiple types of indices that aid 
+  in efficiently looking up data files based on relevant query predicates 
instead of reading the column stats from every 
+  individual data file and then pruning. 
+   
+## Supporting Multi-Modal Index in Hudi
+
+[Multi-modal 
indexing](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi),
 
+introduced in [0.11.0 Hudi 
release](https://hudi.apache.org/releases/release-0.11.0/#multi-modal-index), 
+is a re-imagination of what a general purpose indexing subsystem should look 
like for the lake. Multi-modal indexing is 
+implemented by enhancing Hudi's metadata table with the flexibility to extend 
to new index types as new partitions,
+along with an [asynchronous 
index](https://hudi.apache.org/docs/metadata_indexing/#setup-async-indexing) 
building 
+mechanism and is built on the following core principles:
+- **Scalable metadata**: The table metadata, i.e., the auxiliary data about 
the table, must be scalable to extremely 
+  large size, e.g., Terabytes (TB).  Different types of indices should be 
easily integrated to support various use cases 
+  without having to worry about managing the same. To realize this, all 
indices in Hudi's metadata table are stored as 
+  partitions in a single internal MOR table. The MOR table layout enables 
lightning-fast writes by avoiding synchronous 
+  merge of data with reduced write amplification. This is extremely important 
for large datasets as the size of updates to the 
+  metadata table can grow to be unmanageable otherwise. This helps Hudi to 
scale metadata to TBs of sizes. The 
+  foundational framework for multi-modal indexing is built to enable and 
disable new indices as needed. The 
+  [async 
indexing](https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi) 
supports index building alongside 
+  regular writers without impacting the write latency.
+- **ACID transactional updates**: The index and table metadata must be always 
up-to-date and in sync with the data table. 
+  This is designed via multi-table transaction within Hudi and ensures 
atomicity of writes and resiliency to failures so that 
+  partial writes to either the data or metadata table are never exposed to 
other read or write transactions. The metadata 
+  table is built to be self-managed so users don’t need to spend operational 
cycles on any table services including 
+  compaction and cleaning    
+- **Fast lookup**: The needle-in-a-haystack type of lookups must be fast and 
efficient without having to scan the entire 
+  index, as index size can be TBs for large datasets. Since most access to the 
metadata table are point and range lookups,
+  the HFile format is chosen as the base file format for the internal metadata 
table. Since the metadata table stores 
+  the auxiliary data at the partition level (files index) or the file level 
(column_stats index), the lookup based on a 
+  single partition path and a file group is going to be very efficient with 
the HFile format. Both the base and log files 
+  in Hudi’s metadata table uses the HFile format and are meticulously designed 
to reduce remote GET calls on cloud storages.
+  Further, these metadata table indices are served via a centralized timeline 
server which caches the metadata, further 
+  reducing the latency of the lookup from executors.
+
+### Metadata table indices
+
+Following are the different indices currently available under the metadata 
table.
+
+- ***[files 
index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)***:
 
+  Stored as *files* partition in the metadata table. Contains file information 
such as file name, size, and active state
+  for each partition in the data table. Improves the files listing performance 
by avoiding direct file system calls such
+  as *exists, listStatus* and *listFiles* on the data table.
+
+- ***[column_stats 
index](https://github.com/apache/hudi/blob/master/rfc/rfc-27/rfc-27.md)***: 
Stored as *column_stats* 
+  partition in the metadata table. Contains the statistics of interested 
columns, such as min and max values, total values, 
+  null counts, size, etc., for all data files and are used while serving 
queries with predicates matching interested 
+  columns. This index is used along with the [data 
skipping](https://www.onehouse.ai/blog/hudis-column-stats-index-and-data-skipping-feature-help-speed-up-queries-by-an-orders-of-magnitude)
 
+  to speed up queries by orders of magnitude. 
+
+- ***[bloom_filter 
index](https://github.com/apache/hudi/blob/46f41d186c6c84a6af2c54a907ff2736b6013e15/rfc/rfc-37/rfc-37.md)***:
 
+  Stored as *bloom_filter* partition in the metadata table. This index employs 
range-based pruning on the minimum and 
+  maximum values of the record keys and bloom-filter-based lookups to tag 
incoming records. For large tables, this 
+  involves reading the footers of all matching data files for bloom filters, 
which can be expensive in the case of random 
+  updates across the entire dataset. This index stores bloom filters of all 
data files centrally to avoid scanning the 
+  footers directly from all data files.
+
+- 
***[record_index](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets)***:
 
+  Stored as *record_index* partition in the metadata table. Contains the 
mapping of the record key to location. Record 
+  index is a global index, enforcing key uniqueness across all partitions in 
the table. Most recently added in 0.14.0 
+  Hudi release, this index aids in locating records faster than other existing 
indices and can provide a speedup orders of magnitude 
+  faster in large deployments where index lookup dominates write latencies.
+
+## Enable Hudi Metadata Table and Multi-Modal Index in write side
+
+Following are the Spark based basic configs that are needed to enable metadata 
and multi-modal indices. For advanced configs please refer 
+[here](https://hudi.apache.org/docs/next/configurations#Metadata-Configs-advanced-configs).
+
+| Config Name                               | Default                          
         | Description                                                          
                                                                                
                                                                                
                                                                                
    |
+|-------------------------------------------|-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.metadata.enable                    | true (Optional) Enabled on the 
write side | Enable the internal metadata table which serves table metadata 
like level file listings. For 0.10.1 and prior releases, metadata table is 
disabled by default and needs to be explicitly enabled.<br /><br />`Config 
Param: ENABLE`<br />`Since Version: 0.7.0`                                      
                    |
+| hoodie.metadata.index.bloom.filter.enable | false (Optional)                 
         | Enable indexing bloom filters of user data files under metadata 
table. When enabled, metadata table will have a partition to store the bloom 
filter index and will be used during the index lookups.<br /><br />`Config 
Param: ENABLE_METADATA_INDEX_BLOOM_FILTER`<br />`Since Version: 0.11.0`         
                 |
+| hoodie.metadata.index.column.stats.enable | false (Optional)                 
         | Enable indexing column ranges of user data files under metadata 
table key lookups. When enabled, metadata table will have a partition to store 
the column ranges and will be used for pruning files during the index 
lookups.<br /><br />`Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS`<br 
/>`Since Version: 0.11.0` |
+| hoodie.metadata.record.index.enable       | false (Optional)                 
         | Create the HUDI Record Index within the Metadata Table<br /><br 
/>`Config Param: RECORD_INDEX_ENABLE_PROP`<br />`Since Version: 0.14.0`         
                                                                                
                                                                                
         |
+
+
+The metadata table with synchronous updates and metadata-table-based file 
listing are enabled by default.
+There are prerequisite configurations and steps in [Deployment 
considerations](#deployment-considerations-for-metadata-table) to
+safely use this feature.  The metadata table and related file listing 
functionality can still be turned off by setting
+[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to 
`false`. The 
+[multi-modal 
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 are 
+disabled by default and can be enabled in write side explicitly using the 
above configs.
 
-The Apache Hudi Metadata Table can significantly improve read/write 
performance of your queries. The main purpose of the
-Metadata Table is to eliminate the requirement for the "list files" operation.
+For flink, following are the basic configs of interest to enable metadata 
based indices. Please refer 
+[here](https://hudi.apache.org/docs/next/configurations#Flink-Options) for 
advanced configs
 
-When reading and writing data, file listing operations are performed to get 
the current view of the file system.
-When data sets are large, listing all the files may be a performance 
bottleneck, but more importantly in the case of cloud storage systems
-like AWS S3, the large number of file listing requests sometimes causes 
throttling due to certain request limits.
-The Metadata Table will instead proactively maintain the list of files and 
remove the need for recursive file listing operations
+| Config Name                               | Default                          
         | Description                                                          
                                                                                
                                                                                
                                                                              |
+|-------------------------------------------|-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| metadata.enabled                          | true (Optional)                  
         | Enable the internal metadata table which serves table metadata like 
level file listings, default enabled<br /><br /> `Config Param: 
METADATA_ENABLED`                                                               
                                                                                
               |
+| hoodie.metadata.index.column.stats.enable | false (Optional)                 
         | Enable indexing column ranges of user data files under metadata 
table key lookups. When enabled, metadata table will have a partition to store 
the column ranges and will be used for pruning files during the index 
lookups.<br /> |
 
-### Some numbers from a study:
-Running a TPCDS benchmark the p50 list latencies for a single folder scales 
~linearly with the amount of files/objects:
 
-|Number of files/objects|100|1K|10K|100K|
-|---|---|---|---|---|
-|P50 list latency|50ms|131ms|1062ms|9932ms|
+:::note
+If you turn off the metadata table after enabling, be sure to wait for a few 
commits so that the metadata table is fully
+cleaned up, before re-enabling the metadata table again.
+:::
 
-Whereas listings from the Metadata Table will not scale linearly with 
file/object count and instead take about 100-500ms per read even for very large 
tables.
-Even better, the timeline server caches portions of the metadata (currently 
only for writers), and provides ~10ms performance for listings.
+## Use metadata indices for query side improvements
 
-### Supporting Multi-Modal Index
+### files index
+Metadata based listing using *files_index* can be leveraged on the read side 
by setting appropriate configs/session properties
+from different engines as shown below:
 
-Multi-modal index can drastically improve the lookup performance in file index 
and query latency with data skipping.
-Bloom filter index containing the file-level bloom filter facilitates the key 
lookup and file pruning.  The column stats
-index containing the statistics of all columns improves file pruning based on 
key and column value range in both the
-writer and the reader, in query planning in Spark for example.  Multi-modal 
index is implemented as independent partitions
-containing the indices in the metadata table.
+| Readers                                                                      
    | Config                 | Description                                      
                                                                             |
+|----------------------------------------------------------------------------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------|
+| <ul><li>Spark DataSource</li><li>Spark SQL</li><li>Strucured 
Streaming</li></ul> | hoodie.metadata.enable | When set to `true` enables use 
of the spark file index implementation for Hudi, that speeds up listing of 
large tables.<br /> |
+|Presto| 
[hudi.metadata-table-enabled](https://prestodb.io/docs/current/connector/hudi.html)
             | When set to `true` fetches the list of file names and sizes from 
Hudi’s metadata table rather than storage.                   |
+|Trino| 
[hudi.metadata-enabled](https://trino.io/docs/current/connector/hudi.html#general-configuration)
 | When set to `true` fetches the list of file names and sizes from metadata 
rather than storage.                                |
+|Athena| 
[hudi.metadata-listing-enabled](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html)
 | When this table property is set to `TRUE` enables the Hudi metadata table 
and the related file listing functionality          |
+|<ul><li>Flink DataStream</li><li>Flink SQL</li></ul> | metadata.enabled | 
When set to `true` from DDL uses the internal metadata table to serves table 
metadata like level file listings                |
 
-## Enable Hudi Metadata Table and Multi-Modal Index
-Since 0.11.0, the metadata table with synchronous updates and 
metadata-table-based file listing are enabled by default.
-There are prerequisite configurations and steps in [Deployment 
considerations](#deployment-considerations) to
-safely use this feature.  The metadata table and related file listing 
functionality can still be turned off by setting
-[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to 
`false`.  For 0.10.1 and prior releases, metadata
-table is disabled by default, and you can turn it on by setting the same 
config to `true`.
+### column_stats index and data skipping
+Enabling metadata table and column stats index is a prerequisite to enabling 
data skipping capabilities. Following are the 
+corresponding configs across Spark adn Flink readers.
 
-If you turn off the metadata table after enabling, be sure to wait for a few 
commits so that the metadata table is fully
-cleaned up, before re-enabling the metadata table again.
+| Readers                                                                      
                      | Config                                                  
                         | Description                                          
                                                                                
                                                                                
                                                                                
              [...]
+|----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| <ul><li>Spark DataSource</li><li>Spark SQL</li><li>Strucured 
Streaming</li></ul>                   | 
<ul><li>`hoodie.metadata.enable`</li><li>`hoodie.enable.data.skipping`</li></ul>
 | <ul><li>When set to `true` enables use of the spark file index 
implementation for Hudi, that speeds up listing of large tables.</li><li>When 
set to `true` enables data-skipping allowing queries to leverage indices to 
reduce the search space by skipping over files <br />`Config Param: 
ENABLE_DATA_SKIPPING` [...]
+|<ul><li>Flink DataStream</li><li>Flink SQL</li></ul> 
|<ul><li>`metadata.enabled`</li><li>`read.data.skipping.enabled`</li></ul> | 
<ul><li> When set to `true` from DDL uses the internal metadata table to serves 
table metadata like level file listings</li><li>When set to `true` enables 
data-skipping allowing queries to leverage indices to reduce the search space 
byskipping over files</li></ul>                                                 
                                        |
 
-The [multi-modal 
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 is
-introduced in 0.11.0 release. They are disabled by default. You can choose to 
enable bloom filter index by
-setting `hoodie.metadata.index.bloom.filter.enable` to `true` and enable 
column stats index by setting
-`hoodie.metadata.index.column.stats.enable` to `true`, when metadata table is 
enabled. In 0.11.0 release, data skipping
-to improve queries in Spark now relies on the column stats index in metadata 
table. The enabling of metadata table and
-column stats index is prerequisite to enabling data skipping with 
`hoodie.enable.data.skipping`. Moreover, the metadata
-indexes can be built asynchronously without blocking regular ingestion writers.
-Checkout [asynchronous metadata indexing](/docs/metadata_indexing) docs for 
more details.
-
-## Deployment considerations
-To ensure that Metadata Table stays up to date, all write operations on the 
same Hudi table need additional configurations
+
+## Deployment considerations for metadata Table
+To ensure that metadata table stays up to date, all write operations on the 
same Hudi table need additional configurations
 besides the above in different deployment models.  Before enabling metadata 
table, all writers on the same table must
-be stopped.
+be stopped. Please refer to the different [deployment 
models](/docs/concurrency_control#deployment-models-with-supported-concurrency-controls)
 
+for more details on each model. This section only highlights how to safely 
enable metadata table in different deployment models. 
 
 ### Deployment Model A: Single writer with inline table services
 
-If your current deployment model is single writer and all table services 
(cleaning, clustering, compaction) are configured
-to be inline, such as Deltastreamer sync-once mode and Spark Datasource with 
default configs, there is no additional configuration
-required.  After setting 
[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to 
`true`, restarting
+In [Model 
A](/docs/concurrency_control#model-a-single-writer-with-inline-table-services), 
after setting 
[`hoodie.metadata.enable`](/docs/configurations#hoodiemetadataenable) to 
`true`, restarting
 the single writer is sufficient to safely enable metadata table.
 
 ### Deployment Model B: Single writer with async table services
 
-If your current deployment model is single writer along with async table 
services (such as cleaning, clustering, compaction)
-running in the same process, such as Deltastreamer continuous mode writing MOR 
table, Spark streaming (where compaction is async by default),
-and your own job setup enabling async table services inside the same writer, 
it is a must to have the optimistic concurrency control,
-the lock provider, and lazy failed write clean policy configured before 
enabling metadata table as follows.  This is to guarantee
-the proper behavior of [optimistic concurrency 
control](/docs/concurrency_control#enabling-multi-writing) when enabling
-metadata table. Failing to follow the configuration guide leads to loss of 
data.  Note that these configurations are
-required only if metadata table is enabled in this deployment model.
-
+If your current deployment model is [Model 
B](/docs/concurrency_control#model-b-single-writer-with-async-table-services), 
enabling metadata
+table requires adding optimistic concurrency control along with suggested lock 
provider like below.
 ```properties
 hoodie.write.concurrency.mode=optimistic_concurrency_control
-hoodie.cleaner.policy.failed.writes=LAZY
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
 ```
+:::note
+These configurations are required only if metadata table is enabled in this 
deployment model.
+:::
 
 If multiple writers in different processes are present, including one writer 
with async table services, please refer to
 [Deployment Model C: Multi-writer](#deployment-model-c-multi-writer) for 
configs, with the difference of using a
@@ -86,16 +179,16 @@ process which cannot rely on the in-process lock provider.
 
 ### Deployment Model C: Multi-writer
 
-If your current deployment model is multi-writer along with a lock provider 
and other required configs set for every writer
-as follows, there is no additional configuration required.  You can bring up 
the writers sequentially after stopping the
-writers for enabling metadata table.  Applying the proper configurations to 
only partial writers leads to loss of data
-from the inconsistent writer. So, ensure you enable metadata table across all 
writers.
+If your current deployment model is 
[multi-writer](/docs/concurrency_control#model-c-multi-writer) along with a 
lock 
+provider and other required configs set for every writer as follows, there is 
no additional configuration required. You 
+can bring up the writers sequentially after stopping the writers for enabling 
metadata table. Applying the proper 
+configurations to only partial writers leads to loss of data from the 
inconsistent writer. So, ensure you enable 
+metadata table across all writers.
 
 ```properties
 hoodie.write.concurrency.mode=optimistic_concurrency_control
-hoodie.cleaner.policy.failed.writes=LAZY
 hoodie.write.lock.provider=<distributed-lock-provider-classname>
 ```
 
-Note that there are 4 different [lock providers 
available](/docs/concurrency_control#enabling-multi-writing)
-to choose from: `FileSystemBasedLockProvider`, `ZookeeperBasedLockProvider`, 
`HiveMetastoreBasedLockProvider`, and `DynamoDBBasedLockProvider`.
\ No newline at end of file
+Note that there are different external [lock providers 
available](/docs/concurrency_control#external-locking-and-lock-providers)
+to choose from.
\ No newline at end of file
diff --git a/website/docs/metadata_indexing.md 
b/website/docs/metadata_indexing.md
index 85a10dcec8d..b9874b9cb9c 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -9,11 +9,11 @@ Hudi maintains a scalable [metadata](/docs/metadata) that has 
some auxiliary dat
 The [pluggable indexing 
subsystem](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 of Hudi depends on the metadata table. Different types of index, from `files` 
index for locating records efficiently
 to `column_stats` index for data skipping, are part of the metadata table. A 
fundamental tradeoff in any data system
-that supports indexes is to balance the write throughput with index updates. A 
brute-force way is to lock out the writes
+that supports indices is to balance the write throughput with index updates. A 
brute-force way is to lock out the writes
 while indexing. However, very large tables can take hours to index. This is 
where Hudi's novel asynchronous metadata
 indexing comes into play.
 
-We can now create different metadata indexes, including `files`, 
`bloom_filters` and `column_stats`, asynchronously in
+We can now create different metadata indices, including `files`, 
`bloom_filters`, `column_stats` and `record_index` asynchronously in
 Hudi, which are then used by readers and writers to improve performance. Being 
able to index without blocking writing
 has two benefits,
 
@@ -71,22 +71,15 @@ spark-submit \
 </p>
 </details>
 
-From version 0.11.0 onwards, Hudi metadata table is enabled by default and the 
files index will be automatically created. While the Hudi Streamer is running 
in continuous mode, let
+Hudi metadata table is enabled by default and the files index will be 
automatically created. While the Hudi Streamer is running in continuous mode, 
let
 us schedule the indexing for COLUMN_STATS index. First we need to define a 
properties file for the indexer.
 
 ### Configurations
 
-As mentioned before, metadata indexes are pluggable. One can add any index at 
any point in time depending on changing
-business requirements. Some configurations to enable particular indexes are 
listed below. The full set of metadata
-configurations can be explored [here](/docs/configurations/#Metadata-Configs).
-
-
-|Config| Default | Scope | Description | Since Version |
-|---|---|---|---|---|
-| hoodie.metadata.enable | true | Metadata table | Set to false to disable 
metadata table | 0.7.0 |
-| hoodie.metadata.index.async | false | Metadata table | Enable async indexing 
of metadata table. | 0.11.0 |
-| hoodie.metadata.index.column.stats.enable | false | Metadata table | Enable 
indexing column ranges of user data files under metadata table key lookups | 
0.11.0 |
-| hoodie.metadata.index.bloom.filter.enable | false | Metadata table | Enable 
indexing bloom filters of user data files under metadata table | 0.11.0 |
+As mentioned before, metadata indices are pluggable. One can add any index at 
any point in time depending on changing
+business requirements. Some configurations to enable particular indices are 
listed below. Currently, available indices under
+metadata table can be explored 
[here](/docs/next/metadata#metadata-table-indices) along with 
[configs](/docs/next/metadata#enable-hudi-metadata-table-and-multi-modal-index-in-write-side)
 
+to enable them. The full set of metadata configurations can be explored 
[here](/docs/next/configurations/#Metadata-Configs).
 
 :::note
 Enabling the metadata table and configuring a lock provider are the 
prerequisites for using async indexer. Checkout a sample
diff --git a/website/src/theme/DocPage/index.js 
b/website/src/theme/DocPage/index.js
index 6166cd67181..640f37b3915 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
   );
 }
 
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, 
`${matchPath}/metadata`, `${matchPath}/metadata_indexing`];
 const showCustomStylesForDocs = (matchPath, pathname) => 
arrayOfPages(matchPath).includes(pathname);
 function DocPage(props) {
   const {

[hudi] branch asf-site updated: [DOCS] Update Metadata table and metadata indexing related pages (#9406)

Reply via email to