(hudi) branch asf-site updated: [DOCS] Update secondary index in tech specs (#12508)

vinoth Wed, 18 Dec 2024 18:27:05 -0800

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7a5ca87b4e6 [DOCS] Update secondary index in tech specs (#12508)
7a5ca87b4e6 is described below

commit 7a5ca87b4e64c6eefdc1e207b24852daa314f4b2
Author: Sagar Sumit <[email protected]>
AuthorDate: Thu Dec 19 07:56:45 2024 +0530

    [DOCS] Update secondary index in tech specs (#12508)
---
 website/src/pages/tech-specs-1point0.md | 49 +++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/website/src/pages/tech-specs-1point0.md 
b/website/src/pages/tech-specs-1point0.md
index e6ad40f3a60..917c2f55298 100644
--- a/website/src/pages/tech-specs-1point0.md
+++ b/website/src/pages/tech-specs-1point0.md
@@ -413,6 +413,38 @@ The record index is stored in Hudi metadata table under 
the partition `record_in
 | fileId         | A string that represents fileId of the location where 
record belongs to. When the encoding is 1, fileID is stored in raw string 
format.                                                                         
                                             |
 | instantTime    | A long that represents epoch time in millisecond 
representing the commit time at which record was added.                         
                                                                                
                                            |
 
+### Secondary Index
+
+Just like databases, secondary index is a way to accelerate the queries by 
columns other than the record (primary) keys.
+Hudi supports near-standard [SQL syntax](/docs/sql_ddl#create-index) for 
creating/dropping indexes on different columns
+via Spark SQL, along with an asynchronous indexing table service to build 
indexes without interrupting the writers.
+
+Secondary index definition is serialized to JSON format and saved at a path 
specified by `hoodie.table.index.defs.path`.
+The index itself is stored in Hudi metadata table under the partition 
`secondary_index_<index_name>`. As usual the index
+record is a key-value, however the encoding is slightly more nuanced.
+
+**Key** is constructed by combining the values of **secondary column** and 
**primary key column** separated by a delimiter. 
+The key is encoded in a format that ensures:
+
+1. **Uniqueness**: Each key is distinct.
+2. **Safety**: Any occurrences of the delimiter or escape character within the 
data itself are handled correctly to avoid ambiguity.
+3. **Efficiency**: The encoding and decoding processes are optimized for 
performance while ensuring data integrity.
+
+The key format is:
+```
+<escaped-secondary-key>$<escaped-primary-key>
+
+Where:
+  - `$` is the delimiter separating the secondary key and primary key.
+  - Special characters in the secondary or primary key (`$` and `\`) are 
escaped to avoid conflicts.
+```
+
+**Value** contains metadata about the record, specifically an `isDeleted` flag 
indicating whether the record is valid or has been logically deleted.
+
+For example, consider a secondary index on the `city` column. The key-value 
pair for a record with `city` as `Chennai` and `id` as `id1` would look like:
+```
+chennai$id1 -> {"isDeleted": false}
+```
 
 ### Expression Indexes
 
@@ -424,14 +456,15 @@ Index itself is stored in Hudi metadata table under the 
partition `expr_index_<u
 We covered different [storage layouts](#storage-layout) earlier. Functional 
index aggregates stats by storage partitions and, as such, partitioning can be 
absorbed into functional indexes.
 From that perspective, some useful functions that can also be applied as 
transforms on a field to extract and index partitions are listed below.
 
-| Function   | Description                         |
-|------------|-------------------------------------|
-| `identity` | Identity function, unmodified value |
-| `year`     | Year of the timestamp               |
-| `month`    | Month of the timestamp              |
-| `day`      | Day of the timestamp                |
-| `hour`     | Hour of the timestamp               |
-| `lower`    | Lower case of the string            | 
+| Function        | Description                                                
                       |
+|-----------------|-----------------------------------------------------------------------------------|
+| `identity`      | Identity function, unmodified value                        
                       |
+| `year`          | Year of the timestamp                                      
                       |
+| `month`         | Month of the timestamp                                     
                       |
+| `day`           | Day of the timestamp                                       
                       |
+| `hour`          | Hour of the timestamp                                      
                       |
+| `lower`         | Lower case of the string                                   
                       |
+| `from_unixtime` | Convert unix epoch to a string representing the timestamp 
in the specified format |
 
 ## Relational Model

(hudi) branch asf-site updated: [DOCS] Update secondary index in tech specs (#12508)

Reply via email to