(hudi) branch asf-site updated: [DOCS] Add indexing examples and info for 1.0 (#12409)

vinoth Tue, 03 Dec 2024 07:52:53 -0800

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 40c48f25e7a [DOCS] Add indexing examples and info for 1.0 (#12409)
40c48f25e7a is described below

commit 40c48f25e7ae34b67c558cac6a83077bd4564254
Author: Lokesh Jain <[email protected]>
AuthorDate: Tue Dec 3 21:22:43 2024 +0530

    [DOCS] Add indexing examples and info for 1.0 (#12409)
    
    * [DOCS] Add indexing examples and info for 1.0
    
    * Address review comments
    
    * Add concurrency details
    
    * Address review comments
    
    * Fix build
    
    ---------
    
    Co-authored-by: Lokesh Jain <[email protected]>
---
 website/docs/metadata_indexing.md | 63 ++++++++++++++++++++++++---
 website/docs/quick-start-guide.md | 66 ++++++++++++++++++++++++++++
 website/docs/sql_ddl.md           | 92 ++++++++++++++++++++++++++++++++++++++-
 website/docs/sql_queries.md       |  9 ++--
 4 files changed, 217 insertions(+), 13 deletions(-)

diff --git a/website/docs/metadata_indexing.md 
b/website/docs/metadata_indexing.md
index 59c95400104..e6578275a69 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -10,18 +10,69 @@ The [pluggable indexing 
subsystem](https://www.onehouse.ai/blog/introducing-mult
 of Hudi depends on the metadata table. Different types of index, from `files` 
index for locating records efficiently
 to `column_stats` index for data skipping, are part of the metadata table. A 
fundamental tradeoff in any data system
 that supports indices is to balance the write throughput with index updates. A 
brute-force way is to lock out the writes
-while indexing. However, very large tables can take hours to index. This is 
where Hudi's novel asynchronous metadata
-indexing comes into play.
+while indexing. Hudi supports index creation using SQL, Datasource as well as 
async indexing. However, very large tables 
+can take hours to index. This is where Hudi's novel asynchronous metadata 
indexing comes into play. Indexes in Hudi are
+created in two phases and uses a mix of optimistic concurrency control and 
log-based concurrency control models. The two
+phase approach ensures that the other writers are unblocked.
 
-We can now create different metadata indices, including `files`, 
`bloom_filters`, `column_stats` and `record_index` asynchronously in
-Hudi, which are then used by readers and writers to improve performance. Being 
able to index without blocking writing
+- Scheduling - This is the first phase which schedules an indexing plan and is 
protected by a lock. Indexing plan considers all the completed commits upto 
indexing instant.
+- Execution - This phase creates the index files as mentioned in the index 
plan. At the end of the phase Hudi ensures the completed commits after indexing 
instant used already created index plan to add corresponding index metadata. 
This check is protected by a metadata table lock and in case of failures 
indexing is aborted.
+
+We can now create different metadata indices, including `files`, 
`bloom_filters`, `column_stats`, `partition_stats` and `record_index` 
+asynchronously in Hudi, which are then used by readers and writers to improve 
performance. Being able to index without blocking writing
 has two benefits,
 
 - improved write latency
 - reduced resource wastage due to contention between writing and indexing.
 
-In this document, we will learn how to setup asynchronous metadata indexing. 
To learn more about the design of this
-feature, please check out [this 
blog](https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi).
+In this document, we will learn how to create indexes using SQL, Datasource 
and how to setup asynchronous metadata indexing. 
+To learn more about the design of asynchronous indexing feature, please check 
out [this blog](https://www.onehouse.ai/blog/asynchronous-indexing-using-hudi).
+
+## Index Creation Using SQL
+
+Currently indexes like secondary index, expression index and record index can 
be created using SQL create index command.
+For more information on these indexes please refer [metadata 
section](https://hudi.apache.org/docs/metadata/#metadata-table-indices)
+
+**Examples**
+```sql
+-- Create record index on primary key - uuid
+CREATE INDEX record_index ON hudi_indexed_table (uuid);
+
+-- Create secondary index on rider column.
+CREATE INDEX idx_rider ON hudi_indexed_table (rider);
+
+-- Create expression index by performing transformation on driver and city 
column 
+-- The index is created on the transformed column. Here column stats index is 
created on ts column
+-- and bloom filters index is created on city column.
+CREATE INDEX idx_column_driver ON hudi_indexed_table USING column_stats(rider) 
OPTIONS(expr='upper');
+CREATE INDEX idx_bloom_city ON hudi_indexed_table USING bloom_filters(city) 
OPTIONS(expr='identity');
+```
+
+For more information on index creation using SQL refer [SQL 
DDL](https://hudi.apache.org/docs/next/sql_ddl#create-index) 
+
+## Index Creation Using Datasource
+
+Indexes like `bloom_filters`, `column_stats`, `partition_stats` and 
`record_index` can be created using Datasource. 
+Below we list the various configs which are needed to create the indexes 
mentioned.
+
+```sql
+-- [Required Configs] Partition stats
+hoodie.metadata.index.partition.stats.enable=true
+hoodie.metadata.index.column.stats.enable=true
+-- [Optional Configs] - list of columns to index on. By default all columns 
are indexed
+hoodie.metadata.index.column.stats.column.list=col1,col2,...
+
+-- [Required Configs] Column stats
+hoodie.metadata.index.column.stats.enable=true
+-- [Optional Configs] - list of columns to index on. By default all columns 
are indexed
+hoodie.metadata.index.column.stats.column.list=col1,col2,...
+
+-- [Required Configs] Record Level Index
+hoodie.metadata.record.index.enable=true
+
+-- [Required Configs] Bloom filter Index
+hoodie.metadata.index.bloom.filter.enable=true
+```
 
 ## Setup Async Indexing
 
diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index 0ff3225da1c..ebc1c6271f2 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -415,6 +415,72 @@ spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, 
_hoodie_partition_pat
 </Tabs
 >
 
+## Index data {#indexing}
+
+<Tabs
+groupId="programming-language"
+defaultValue="sparksql"
+values={[
+{ label: 'Spark SQL', value: 'sparksql', },
+]}
+>
+
+<TabItem value="sparksql">
+
+```sql
+-- Create a table with primary key
+CREATE TABLE hudi_indexed_table (
+    ts BIGINT,
+    uuid STRING,
+    rider STRING,
+    driver STRING,
+    fare DOUBLE,
+    city STRING
+) USING HUDI
+options(
+    primaryKey ='uuid',
+    hoodie.datasource.write.payload.class = 
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload"
+)
+PARTITIONED BY (city);
+
+INSERT INTO hudi_indexed_table
+VALUES
+(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
+(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 
,'san_francisco'),
+(1695046462,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 
,'san_francisco'),
+(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
+(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ),
+(1695376420,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 
,'sao_paulo'    ),
+(1695173887,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 
,'chennai'      ),
+(1695115999,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
+
+-- Create bloom filter expression index on city column
+CREATE INDEX idx_bloom_city ON hudi_indexed_table USING bloom_filters(city) 
OPTIONS(expr='identity');
+-- It would show bloom filter expression index
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on city column would prune the data using the idx_bloom_city index
+SELECT uuid, rider FROM hudi_indexed_table WHERE city = 'san_francisco';
+
+-- Create column stat expression index on ts column
+CREATE INDEX idx_column_driver ON hudi_indexed_table USING column_stats(rider) 
OPTIONS(expr='upper');
+-- Shows both expression indexes
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on ts column would prune the data using the idx_column_ts index
+SELECT * FROM hudi_indexed_table WHERE upper(driver) = 'DRIVER-S';
+
+-- Create secondary index on rider column
+CREATE INDEX record_index ON hudi_indexed_table (uuid);
+CREATE INDEX idx_rider ON hudi_indexed_table (rider);
+-- Expression index and secondary index should show up
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on rider column would leverage the secondary index idx_rider
+SELECT * FROM hudi_indexed_table WHERE rider = 'rider-E';
+```
+</TabItem>
+
+</Tabs
+>
+
 ## Update data {#upserts}
 
 Hudi tables can be updated by streaming in a DataFrame or using a standard 
UPDATE statement.
diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index 08d8380afaf..30f21e61bff 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -190,9 +190,75 @@ TBLPROPERTIES (
 AS SELECT * FROM parquet_table;
 ```
 
-### Create Index (Experimental)
+### Create Index
 
-Hudi supports creating and dropping indexes, including functional indexes, on 
a table.
+Hudi supports creating and dropping indexes, including functional indexes, on 
a table. For more information on different
+type of indexes please refer [metadata 
section](https://hudi.apache.org/docs/metadata/#metadata-table-indices). 
Secondary 
+index, expression index and record index can be created using SQL create index 
command.
+
+**Syntax**
+```sql
+-- Record index syntax
+CREATE INDEX indexName ON tableIdentifier (primaryKey1 [, primayKey2 ...]);
+
+-- Secondary Index Syntax
+CREATE INDEX indexName ON tableIdentifier (nonPrimaryKey);
+
+-- Expression Index Syntax
+CREATE INDEX indexName ON tableIdentifier USING column_stats(col) 
OPTIONS(expr='expr_val', format='format_val');
+CREATE INDEX indexName ON tableIdentifier USING bloom_filters(col) 
OPTIONS(expr='expr_val');
+```
+
+**Examples**
+```sql
+-- Create a table with primary key
+CREATE TABLE hudi_indexed_table (
+    ts BIGINT,
+    uuid STRING,
+    rider STRING,
+    driver STRING,
+    fare DOUBLE,
+    city STRING
+) USING HUDI
+options(
+    primaryKey ='uuid',
+    hoodie.datasource.write.payload.class = 
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload"
+)
+PARTITIONED BY (city);
+
+INSERT INTO hudi_indexed_table
+VALUES
+(1695159649,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
+(1695091554,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 
,'san_francisco'),
+(1695046462,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 
,'san_francisco'),
+(1695332066,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
+(1695516137,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ),
+(1695376420,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 
,'sao_paulo'    ),
+(1695173887,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 
,'chennai'      ),
+(1695115999,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
+
+-- Create bloom filter expression index on city column
+CREATE INDEX idx_bloom_city ON hudi_indexed_table USING bloom_filters(city) 
OPTIONS(expr='identity');
+-- It would show bloom filter expression index
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on city column would prune the data using the idx_bloom_city index
+SELECT uuid, rider FROM hudi_indexed_table WHERE city = 'san_francisco';
+
+-- Create column stat expression index on ts column
+CREATE INDEX idx_column_driver ON hudi_indexed_table USING column_stats(rider) 
OPTIONS(expr='upper');
+-- Shows both expression indexes
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on ts column would prune the data using the idx_column_ts index
+SELECT * FROM hudi_indexed_table WHERE upper(driver) = 'DRIVER-S';
+
+-- Create secondary index on rider column
+CREATE INDEX record_index ON hudi_indexed_table (uuid);
+CREATE INDEX idx_rider ON hudi_indexed_table (rider);
+-- Expression index and secondary index should show up
+SHOW INDEXES FROM hudi_indexed_table;
+-- Query on rider column would leverage the secondary index idx_rider
+SELECT * FROM hudi_indexed_table WHERE rider = 'rider-E';
+```
 
 :::note
 Creating indexes through SQL is in preview in version 1.0.0-beta only. It will 
be generally available in version 1.0.0.
@@ -582,6 +648,28 @@ SHOW PARTITIONS hudi_table;
 --Drop partition：
 ALTER TABLE hudi_table DROP PARTITION (dt='2021-12-09', hh='10');
 ```
+
+### Show and drop index
+
+**Syntax**
+
+```sql
+-- Show Indexes
+SHOW INDEXES FROM tableIdentifier;
+
+-- Drop partition
+DROP INDEX indexIdentifier ON tableIdentifier;
+```
+
+**Examples**
+```sql
+-- Show indexes
+SHOW INDEXES FROM hudi_indexed_table;
+
+-- Drop Index
+DROP INDEX record_index ON hudi_indexed_table;
+```
+
 ### Show create table
 
 **Syntax**
diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md
index 2180b40a48d..350a372bf35 100644
--- a/website/docs/sql_queries.md
+++ b/website/docs/sql_queries.md
@@ -16,14 +16,13 @@ The Spark [quickstart](/docs/quick-start-guide) provides a 
good overview of how
 
 ### Snapshot Query
 Snapshot queries are the most common query type for Hudi tables. Spark SQL 
supports snapshot queries on both COPY_ON_WRITE and MERGE_ON_READ tables.
-Using session properties, you can specify various options around data skipping 
and indexing to optimize query performance, as shown below.
+Using session properties, you can specify options around indexing to optimize 
query performance, as shown below.
 
 ```sql
--- You can turn on any relevant options for data skipping and indexing. 
--- for e.g. the following turns on data skipping based on column stats
-SET hoodie.enable.data.skipping=true;
+-- You can turn on relevant options for indexing. 
+
+-- Turn on use of column stat index, to perform range queries.
 SET hoodie.metadata.column.stats.enable=true;
-SET hoodie.metadata.enable=true;
 SELECT * FROM hudi_table
 WHERE price > 1.0 and price < 10.0

(hudi) branch asf-site updated: [DOCS] Add indexing examples and info for 1.0 (#12409)

Reply via email to