codope commented on code in PR #10087: URL: https://github.com/apache/hudi/pull/10087#discussion_r1393221202
########## website/docs/sql_ddl.md: ########## @@ -178,6 +178,145 @@ TBLPROPERTIES ( AS SELECT * FROM parquet_table; ``` +### Create Index + +Hudi supports creating and dropping indexes, including functional indexes, on a table. + +:::note +Creating indexes through SQL is in preview in version 1.0.0-beta only. It will be generally available in version 1.0.0. +Please report any issues you find either via [GitHub issues](https://github.com/apache/hudi/issues) or creating a [JIRA](https://issues.apache.org/jira/projects/HUDI/issues). +::: + +```sql +-- Create Index +CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name +[USING index_type] +(column_name1 [OPTIONS(key1=value1, key2=value2, ...)], column_name2 [OPTIONS(key1=value1, key2=value2, ...)], ...) +[OPTIONS (key1=value1, key2=value2, ...)] + +-- Drop Index +DROP INDEX [IF EXISTS] index_name ON [TABLE] table_name +``` + +- `index_name` is the name of the index to be created or dropped. +- `table_name` is the name of the table on which the index is created or dropped. +- `index_type` is the type of the index to be created. Currently, only `files`, `column_stats` and `bloom_filters` is supported. +- `column_name` is the name of the column on which the index is created. +- Both index and column on which the index is created can be qualified with some options in the form of key-value pairs. + We will see this with an example of functional index below. + +#### Create Functional Index + +A [functional index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md) +is an index on a function of a column. It is a new addition to Hudi's [multi-modal indexing](https://hudi.apache.org/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi) +subsystem which provides faster access method and also absorb partitioning as part of the indexing system. Let us see +some examples of creating a functional index. + +```sql +-- Create a functional index on the column `ts` (unix epoch) of the table `hudi_table` using the function `from_unixtime` with the format `yyyy-MM-dd` +CREATE INDEX IF NOT EXISTS ts_datestr ON hudi_table USING column_stats(ts) OPTIONS(func='from_unixtime', format='yyyy-MM-dd'); +-- Create a functional index on the column `ts` (timestamp in yyyy-MM-dd HH:mm:ss) of the table `hudi_table` using the function `hour` +CREATE INDEX ts_hour ON hudi_table_func_index USING column_stats(ts) options(func='hour'); +``` + +Few things to note: +- The `func` option is required for creating functional index, and it should be a valid Spark SQL function. Currently, + only the functions that take a single column as input are supported. Some useful functions that are supported are listed below. + - `identity` + - `from_unixtime` + - `date_format` + - `to_date` + - `to_timestamp` + - `year` + - `month` + - `day` + - `hour` + - `lower` + - `upper` + - `substring` + - `regexp_extract` + - `regexp_replace` + - `concat` + - `length` +- Please check the syntax for the above functions in + the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref-functions.html) and provide the options + accordingly. For example, the `format` option is required for `from_unixtime` function. +- UDFs are not supported. + +<details> + <summary>Example of creating and using functional index</summary> + +```sql +DROP TABLE IF EXISTS hudi_table_func_index; Review Comment: Actually `hudi_table_func_index` is the name of the table and not the index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
