This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7a5ca87b4e6 [DOCS] Update secondary index in tech specs (#12508)
7a5ca87b4e6 is described below
commit 7a5ca87b4e64c6eefdc1e207b24852daa314f4b2
Author: Sagar Sumit <[email protected]>
AuthorDate: Thu Dec 19 07:56:45 2024 +0530
[DOCS] Update secondary index in tech specs (#12508)
---
website/src/pages/tech-specs-1point0.md | 49 +++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 8 deletions(-)
diff --git a/website/src/pages/tech-specs-1point0.md
b/website/src/pages/tech-specs-1point0.md
index e6ad40f3a60..917c2f55298 100644
--- a/website/src/pages/tech-specs-1point0.md
+++ b/website/src/pages/tech-specs-1point0.md
@@ -413,6 +413,38 @@ The record index is stored in Hudi metadata table under
the partition `record_in
| fileId | A string that represents fileId of the location where
record belongs to. When the encoding is 1, fileID is stored in raw string
format.
|
| instantTime | A long that represents epoch time in millisecond
representing the commit time at which record was added.
|
+### Secondary Index
+
+Just like databases, secondary index is a way to accelerate the queries by
columns other than the record (primary) keys.
+Hudi supports near-standard [SQL syntax](/docs/sql_ddl#create-index) for
creating/dropping indexes on different columns
+via Spark SQL, along with an asynchronous indexing table service to build
indexes without interrupting the writers.
+
+Secondary index definition is serialized to JSON format and saved at a path
specified by `hoodie.table.index.defs.path`.
+The index itself is stored in Hudi metadata table under the partition
`secondary_index_<index_name>`. As usual the index
+record is a key-value, however the encoding is slightly more nuanced.
+
+**Key** is constructed by combining the values of **secondary column** and
**primary key column** separated by a delimiter.
+The key is encoded in a format that ensures:
+
+1. **Uniqueness**: Each key is distinct.
+2. **Safety**: Any occurrences of the delimiter or escape character within the
data itself are handled correctly to avoid ambiguity.
+3. **Efficiency**: The encoding and decoding processes are optimized for
performance while ensuring data integrity.
+
+The key format is:
+```
+<escaped-secondary-key>$<escaped-primary-key>
+
+Where:
+ - `$` is the delimiter separating the secondary key and primary key.
+ - Special characters in the secondary or primary key (`$` and `\`) are
escaped to avoid conflicts.
+```
+
+**Value** contains metadata about the record, specifically an `isDeleted` flag
indicating whether the record is valid or has been logically deleted.
+
+For example, consider a secondary index on the `city` column. The key-value
pair for a record with `city` as `Chennai` and `id` as `id1` would look like:
+```
+chennai$id1 -> {"isDeleted": false}
+```
### Expression Indexes
@@ -424,14 +456,15 @@ Index itself is stored in Hudi metadata table under the
partition `expr_index_<u
We covered different [storage layouts](#storage-layout) earlier. Functional
index aggregates stats by storage partitions and, as such, partitioning can be
absorbed into functional indexes.
From that perspective, some useful functions that can also be applied as
transforms on a field to extract and index partitions are listed below.
-| Function | Description |
-|------------|-------------------------------------|
-| `identity` | Identity function, unmodified value |
-| `year` | Year of the timestamp |
-| `month` | Month of the timestamp |
-| `day` | Day of the timestamp |
-| `hour` | Hour of the timestamp |
-| `lower` | Lower case of the string |
+| Function | Description
|
+|-----------------|-----------------------------------------------------------------------------------|
+| `identity` | Identity function, unmodified value
|
+| `year` | Year of the timestamp
|
+| `month` | Month of the timestamp
|
+| `day` | Day of the timestamp
|
+| `hour` | Hour of the timestamp
|
+| `lower` | Lower case of the string
|
+| `from_unixtime` | Convert unix epoch to a string representing the timestamp
in the specified format |
## Relational Model