[hudi] branch asf-site updated: [HUDI-3931][DOCS] Guide to setup async metadata indexing (#5476)

xushiyan Sat, 30 Apr 2022 06:14:43 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 2ad9817d66 [HUDI-3931][DOCS] Guide to setup async metadata indexing 
(#5476)
2ad9817d66 is described below

commit 2ad9817d66cbe712aac55f56d018671fa024fa72
Author: Sagar Sumit <[email protected]>
AuthorDate: Sat Apr 30 18:44:34 2022 +0530

    [HUDI-3931][DOCS] Guide to setup async metadata indexing (#5476)
    
    
    * Point to async indexing guide in release note
---
 website/docs/async_meta_indexing.md | 188 ++++++++++++++++++++++++++++++++++++
 website/releases/release-0.11.0.md  |  31 +-----
 website/sidebars.js                 |   3 +-
 3 files changed, 191 insertions(+), 31 deletions(-)

diff --git a/website/docs/async_meta_indexing.md 
b/website/docs/async_meta_indexing.md
new file mode 100644
index 0000000000..a76276d3fb
--- /dev/null
+++ b/website/docs/async_meta_indexing.md
@@ -0,0 +1,188 @@
+---
+title: Async Metadata Indexing
+summary: "In this page, we describe how to run metadata indexing 
asynchronously."
+toc: true
+last_modified_at:
+---
+
+We can now create different metadata indexes, including files, bloom filters 
and column stats, 
+asynchronously in Hudi. Being able to index without blocking ingestion has two 
benefits, 
+improved ingestion latency (and hence even lesser gap between event time and 
arrival time), 
+and reduced point of failure on the ingestion path. To learn more about the 
design of this 
+feature, please check out 
[RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md).
+
+## Setup Async Indexing
+
+First, we will generate a continuous workload. In the below example, we are 
going to start a [deltastreamer](/docs/hoodie_deltastreamer#deltastreamer) 
which will continuously write data
+from raw parquet to Hudi table. We used the widely available [NY Taxi 
dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/), whose setup 
details are as below:
+<details>
+  <summary>Ingestion write config</summary>
+<p>
+
+```
+hoodie.datasource.write.recordkey.field=VendorID
+hoodie.datasource.write.partitionpath.field=tpep_dropoff_datetime
+hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
+hoodie.deltastreamer.source.dfs.root=/Users/home/path/to/data/parquet_files/
+hoodie.deltastreamer.schemaprovider.target.schema.file=/Users/home/path/to/schema/schema.avsc
+hoodie.deltastreamer.schemaprovider.source.schema.file=/Users/home/path/to/schema/schema.avsc
+// set lock provider configs
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
+hoodie.write.lock.zookeeper.url=<zk_url>
+hoodie.write.lock.zookeeper.port=<zk_port>
+hoodie.write.lock.zookeeper.lock_key=<zk_key>
+hoodie.write.lock.zookeeper.base_path=<zk_base_path>
+```
+
+</p>
+</details>
+
+<details>
+  <summary>Run deltastreamer</summary>
+<p>
+
+```
+spark-submit \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar`
 \
+--props `ls /Users/home/path/to/write/config.properties` \
+--source-class org.apache.hudi.utilities.sources.ParquetDFSSource  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
\
+--source-ordering-field tpep_dropoff_datetime   \
+--table-type COPY_ON_WRITE \
+--target-base-path file:///tmp/hudi-ny-taxi/   \
+--target-table ny_hudi_tbl  \
+--op UPSERT  \
+--continuous \
+--source-limit 5000000 \
+--min-sync-interval-seconds 60
+```
+
+</p>
+</details>
+
+From version 0.11.0 onwards, Hudi metadata table is enabled by default and the 
files index will be automatically created. While the deltastreamer is running 
in continuous mode, let
+us schedule the indexing for COLUMN_STATS index. First we need to define a 
properties file for the indexer.
+
+Note Enabling metadata table and configuring a lock provider are the 
prerequisites for using async indexer.
+```
+# ensure that both metadata and async indexing is enabled as below two configs
+hoodie.metadata.enable=true
+hoodie.metadata.index.async=true
+# enable column_stats index config
+hoodie.metadata.index.column.stats.enable=true
+# set concurrency mode and lock configs as this is a multi-writer scenario
+# check https://hudi.apache.org/docs/concurrency_control/ for differnt lock 
provider configs
+hoodie.write.concurrency.mode=optimistic_concurrency_control
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
+hoodie.write.lock.zookeeper.url=<zk_url>
+hoodie.write.lock.zookeeper.port=<zk_port>
+hoodie.write.lock.zookeeper.lock_key=<zk_key>
+hoodie.write.lock.zookeeper.base_path=<zk_base_path>
+```
+
+### Schedule indexing
+
+Now, we can schedule indexing using `HoodieIndexer` in `schedule` mode as 
follows:
+
+```
+spark-submit \
+--class org.apache.hudi.utilities.HoodieIndexer \
+/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
 \
+--props /Users/home/path/to/indexer.properties \
+--mode schedule \
+--base-path /tmp/hudi-ny-taxi \
+--table-name ny_hudi_tbl \
+--index-types COLUMN_STATS \
+--parallelism 1 \
+--spark-memory 1g
+```
+
+This will write an `indexing.requested` instant to the timeline.
+
+### Execute Indexing
+
+To execute indexing, run the indexer in `execute` mode as below.
+
+```
+spark-submit \
+--class org.apache.hudi.utilities.HoodieIndexer \
+/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
 \
+--props /Users/home/path/to/indexer.properties \
+--mode execute \
+--base-path /tmp/hudi-ny-taxi \
+--table-name ny_hudi_tbl \
+--index-types COLUMN_STATS \
+--parallelism 1 \
+--spark-memory 1g
+```
+
+We can also run the indexer in `scheduleAndExecute` mode to do the above two 
steps in one shot. Doing it separately gives us better control over when we 
want to execute.
+
+Let's look at the data timeline.
+
+```
+ls -lrt /tmp/hudi-ny-taxi/.hoodie
+total 1816
+-rw-r--r--  1 sagars  wheel       0 Apr 14 19:53 
20220414195327683.commit.requested
+-rw-r--r--  1 sagars  wheel  153423 Apr 14 19:54 20220414195327683.inflight
+-rw-r--r--  1 sagars  wheel  207061 Apr 14 19:54 20220414195327683.commit
+-rw-r--r--  1 sagars  wheel       0 Apr 14 19:54 
20220414195423420.commit.requested
+-rw-r--r--  1 sagars  wheel     659 Apr 14 19:54 
20220414195437837.indexing.requested
+-rw-r--r--  1 sagars  wheel  323950 Apr 14 19:54 20220414195423420.inflight
+-rw-r--r--  1 sagars  wheel       0 Apr 14 19:55 
20220414195437837.indexing.inflight
+-rw-r--r--  1 sagars  wheel  222920 Apr 14 19:55 20220414195423420.commit
+-rw-r--r--  1 sagars  wheel     734 Apr 14 19:55 hoodie.properties
+-rw-r--r--  1 sagars  wheel     979 Apr 14 19:55 20220414195437837.indexing
+```
+
+In the data timeline, we can see that indexing was scheduled after one commit 
completed (`20220414195327683.commit`) and another was requested
+(`20220414195423420.commit.requested`). This would have picked 
`20220414195327683` as the base instant. Indexing was inflight with an inflight 
writer as well. If we parse the
+indexer logs, we would find that it indeed caught up with instant 
`20220414195423420` after indexing upto the base instant.
+
+```
+22/04/14 19:55:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
/tmp/hudi-ny-taxi/.hoodie/metadata
+22/04/14 19:55:22 INFO RunIndexActionExecutor: Starting Index Building with 
base instant: 20220414195327683
+22/04/14 19:55:22 INFO HoodieBackedTableMetadataWriter: Creating a new 
metadata index for partition 'column_stats' under path 
/tmp/hudi-ny-taxi/.hoodie/metadata upto instant 20220414195327683
+...
+...
+22/04/14 19:55:38 INFO RunIndexActionExecutor: Total remaining instants to 
index: 1
+22/04/14 19:55:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from /tmp/hudi-ny-taxi/.hoodie/metadata
+22/04/14 19:55:38 INFO HoodieTableConfig: Loading table properties from 
/tmp/hudi-ny-taxi/.hoodie/metadata/.hoodie/hoodie.properties
+22/04/14 19:55:38 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
/tmp/hudi-ny-taxi/.hoodie/metadata
+22/04/14 19:55:38 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220414195423420__deltacommit__COMPLETED]}
+22/04/14 19:55:38 INFO RunIndexActionExecutor: Starting index catchup task
+...
+```
+
+### Drop Index
+
+To drop an index, just run the index in `dropindex` mode.
+
+```
+spark-submit \
+--class org.apache.hudi.utilities.HoodieIndexer \
+/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
 \
+--props /Users/home/path/to/indexer.properties \
+--mode dropindex \
+--base-path /tmp/hudi-ny-taxi \
+--table-name ny_hudi_tbl \
+--index-types COLUMN_STATS \
+--parallelism 1 \
+--spark-memory 2g
+```
+
+## Caveats
+
+Asynchronous indexing feature is still evolving. Few points to note from 
deployment perspective while running the indexer:
+
+- While an index can be created concurrently with ingestion, it cannot be 
dropped concurrently. Please stop all writers
+  before dropping an index.
+- Files index is created by default as long as the metadata table is enabled.
+- Trigger indexing for one metadata partition (or index type) at a time.
+- If an index is enabled via async HoodieIndexer, then ensure that index is 
also enabled in configs corresponding to regular ingestion writers. Otherwise, 
metadata writer will
+  think that particular index was disabled and cleanup the metadata partition.
+- In the case of multi-writers, enable async index and specific index config 
for all writers.
+- Unlike other table services like compaction and clustering, where we have a 
separate configuration to run inline, there is no such inline config here. 
+  For example, if async indexing is disabled and metadata is enabled along 
with column stats index type, then both files and column stats index will be 
created synchronously with ingestion.
+
+Some of these limitations will be overcome in the upcoming releases. Please
+follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for 
developments on this feature.
diff --git a/website/releases/release-0.11.0.md 
b/website/releases/release-0.11.0.md
index a56b16f663..3eebcf66e9 100644
--- a/website/releases/release-0.11.0.md
+++ b/website/releases/release-0.11.0.md
@@ -56,7 +56,7 @@ ingestion. The indexer adds a new action `indexing` on the 
timeline. While the i
 and non-blocking to writers, a lock provider needs to be configured to safely 
co-ordinate the process with the inflight
 writers.
 
-*See the [migration guide](#migration-guide) for more details.*
+*See the [async indexing guide](/docs/next/async_meta_indexing) for more 
details.*
 
 ### Spark DataSource Improvements
 
@@ -182,35 +182,6 @@ tables. This is useful when tailing Hive tables in 
`HoodieDeltaStreamer` instead
 
 ## Migration Guide
 
-### Use async indexer
-
-Enabling metadata table and configuring a lock provider are the prerequisites 
for using async indexer. The
-implementation details were illustrated in 
[RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md). At
-the minimum, users need to set the following configurations to schedule and 
run the indexer:
-
-```shell
-# enable async index
-hoodie.metadata.index.async=true
-# enable specific index type, column stats for example 
-hoodie.metadata.index.column.stats.enable=true
-# set OCC concurrency mode
-hoodie.write.concurrency.mode=optimistic_concurrency_control
-# set lock provider configs
-hoodie.write.lock.provider=<LockProviderClass>
-```
-
-Few points to note from deployment perspective:
-
-1. Files index is created by default as long as the metadata table is enabled.
-2. If you intend to build any index asynchronously, say column stats, then be 
sure to enable the async index and column
-   stats index type on the regular ingestion writers as well.
-3. In the case of multi-writers, enable async index and specific index config 
for all writers.
-4. While an index can be created concurrently with ingestion, it cannot be 
dropped concurrently. Please stop all writers
-   before dropping an index.
-
-Some of these limitations will be overcome in the upcoming releases. Please
-follow [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488) for 
developments on this feature.
-
 ### Bundle usage
 
 As we relax the requirement of adding `spark-avro` package in 0.11.0 to work 
with Spark and Utilities bundle,
diff --git a/website/sidebars.js b/website/sidebars.js
index c970c1ac5c..90acdf2d3f 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -74,7 +74,8 @@ module.exports = {
                 'file_sizing',
                 'disaster_recovery',
                 'snapshot_exporter',
-                'precommit_validator'
+                'precommit_validator',
+                'async_meta_indexing'
             ],
         },
         'configurations',

[hudi] branch asf-site updated: [HUDI-3931][DOCS] Guide to setup async metadata indexing (#5476)

Reply via email to