This is an automated email from the ASF dual-hosted git repository.
codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a8759beb8ee [DOCS] Update metadata indexing doc (#7967)
a8759beb8ee is described below
commit a8759beb8ee8ca28f76430e6c2656f220b20a4e7
Author: Sagar Sumit <[email protected]>
AuthorDate: Mon Feb 27 13:35:09 2023 +0530
[DOCS] Update metadata indexing doc (#7967)
---
website/docs/metadata.md | 13 ++++++++-----
website/docs/metadata_indexing.md | 35 ++++++++++++++++++++++++++++++-----
2 files changed, 38 insertions(+), 10 deletions(-)
diff --git a/website/docs/metadata.md b/website/docs/metadata.md
index c10d3e0d9b1..e0ffa5b3b21 100644
--- a/website/docs/metadata.md
+++ b/website/docs/metadata.md
@@ -41,11 +41,14 @@ table is disabled by default, and you can turn it on by
setting the same config
If you turn off the metadata table after enabling, be sure to wait for a few
commits so that the metadata table is fully
cleaned up, before re-enabling the metadata table again.
-The multi-modal index is introduced in 0.11.0 release. They are disabled by
default. You can choose to enable bloom
-filter index by setting `hoodie.metadata.index.bloom.filter.enable` to `true`
and enable column stats index by setting
-`hoodie.metadata.index.column.stats.enable` to `true`, when metadata table is
enabled. In 0.11.0 release, data skipping
-to improve queries in Spark now relies on the column stats index in metadata
table. The enabling of metadata table and
-column stats index is prerequisite to enabling data skipping with
`hoodie.enable.data.skipping`.
+The [multi-modal
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
is
+introduced in 0.11.0 release. They are disabled by default. You can choose to
enable bloom filter index by
+setting `hoodie.metadata.index.bloom.filter.enable` to `true` and enable
column stats index by setting
+`hoodie.metadata.index.column.stats.enable` to `true`, when metadata table is
enabled. In 0.11.0 release, data skipping
+to improve queries in Spark now relies on the column stats index in metadata
table. The enabling of metadata table and
+column stats index is prerequisite to enabling data skipping with
`hoodie.enable.data.skipping`. Moreover, the metadata
+indexes can be built asynchronously without blocking regular ingestion writers.
+Checkout [asynchronous metadata indexing](/docs/metadata_indexing) docs for
more details.
## Deployment considerations
To ensure that Metadata Table stays up to date, all write operations on the
same Hudi table need additional configurations
diff --git a/website/docs/metadata_indexing.md
b/website/docs/metadata_indexing.md
index 18dacc74ce4..3a785c41d6c 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -5,13 +5,23 @@ toc: true
last_modified_at:
---
-We can now create different metadata indexes, including files, bloom filters
and column stats,
-asynchronously in Hudi, which are then used by queries and writing to improve
performance.
-Being able to index without blocking writing has two benefits,
+Hudi maintains a scalable [metadata](/docs/metadata) that has some auxiliary
data about the table.
+The [pluggable indexing
subsystem](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
+of Hudi depends on the metadata table. Different types of index, from `files`
index for locating records efficiently
+to `column_stats` index for data skipping, are part of the metadata table. A
fundamental tradeoff in any data system
+that supports indexes is to balance the write throughput with index updates. A
brute-force way is to lock out the writes
+while indexing. However, very large tables can take hours to index. This is
where Hudi's novel asynchronous metadata
+indexing comes into play.
+
+We can now create different metadata indexes, including `files`,
`bloom_filters` and `column_stats`, asynchronously in
+Hudi, which are then used by readers and writers to improve performance. Being
able to index without blocking writing
+has two benefits,
+
- improved write latency
- reduced resource wastage due to contention between writing and indexing.
-To learn more about the design of this feature, please check out
[RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md).
+In this document, we will learn how to setup asynchronous metadata indexing.
To learn more about the design of this
+feature, please check out [this
blog](https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi).
## Setup Async Indexing
@@ -64,8 +74,23 @@ spark-submit \
From version 0.11.0 onwards, Hudi metadata table is enabled by default and the
files index will be automatically created. While the deltastreamer is running
in continuous mode, let
us schedule the indexing for COLUMN_STATS index. First we need to define a
properties file for the indexer.
+### Configurations
+
+As mentioned before, metadata indexes are pluggable. One can add any index at
any point in time depending on changing
+business requirements. Some configurations to enable particular indexes are
listed below. The full set of metadata
+configurations can be explored [here](/docs/configurations/#Metadata-Configs).
+
+
+|Config| Default | Scope | Description | Since Version |
+|---|---|---|---|---|
+| hoodie.metadata.enable | true | Metadata table | Set to false to disable
metadata table | 0.7.0 |
+| hoodie.metadata.index.async | false | Metadata table | Enable async indexing
of metadata table. | 0.11.0 |
+| hoodie.metadata.index.column.stats.enable | false | Metadata table | Enable
indexing column ranges of user data files under metadata table key lookups |
0.11.0 |
+| hoodie.metadata.index.bloom.filter.enable | false | Metadata table | Enable
indexing bloom filters of user data files under metadata table | 0.11.0 |
+
:::note
-Enabling metadata table and configuring a lock provider are the prerequisites
for using async indexer.
+Enabling the metadata table and configuring a lock provider are the
prerequisites for using async indexer. Checkout a sample
+configuration below.
:::
```