Re: [PR] [HUDI-7146] [RFC-77] RFC for secondary index [hudi]

via GitHub Tue, 23 Apr 2024 10:36:05 -0700


codope commented on code in PR #10814:
URL: https://github.com/apache/hudi/pull/10814#discussion_r1576637313



##########
rfc/rfc-77/rfc-77.md:
##########
@@ -0,0 +1,323 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-77: Secondary Indexes
+
+## Proposers
+
+- @bhat-vinay
+- @codope
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7146
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+In this RFC, we propose implementing Secondary Indexes (SI), a new capability 
in Hudi's metadata table (MDT) based indexing 
+system.  SI are indexes defined on user specified columns of the table. 
Similar to record level indexes,
+SI will improve query performance when the query predicate contains secondary 
keys. The number of files
+that a query needs to scan can be pruned down using secondary indexes.
+
+## Background
+
+Hudi supports different indexes through its MDT. These indexes help to improve 
query performance by
+pruning down the set of files that need to be scanned to build the result set 
(of the query). 
+
+One of the supported index in Hudi is the Record Level Index (RLI). RLI acts 
as a unique-key index and can be used to 
+locate a FileGroup of a record based on its RecordKey. A query having an EQUAL 
or IN predicate on the RecordKey will 
+have a performance boost as the RLI can accurately give a subset of FileGroups 
that contain the rows matching the 
+predicate.
+
+Many workloads have queries with predicates that are not based on RecordKey. 
Such queries cannot use RLI for data
+skipping. Traditional databases have a notion of building indexes (called 
Secondary Index or SI) on user specified 
+columns to aid such queries. This RFC proposes implementing SI in Hudi. Users 
can build SI on columns which are 
+frequently used as filtering columns (i.e columns on which query predicate is 
based on). As with any other index, 
+building and maintaining SI adds overhead on the write path. Users should 
choose wisely based 
+on their workload. Tools can be built to provide guidance on the usefulness of 
indexing a specific column, but it is 
+not in the scope of this RFC.
+
+## Design and Implementation
+This section discusses briefly the goals, design, implementation details of 
supporting SI in Hudi. At a high level,
+the design principle and goals are as follows:
+1. User specifies SI to be built on a given column of a table. A given SI can 
be built on only one column of the table
+(i.e composite keys are not allowed). Any number of SI can be built on a Hudi 
table. The indexes to be built are 
+specified using regular SQL statements.
+2. Metadata of a SI will be tracked through the index metadata file under 
`<base_path>/.hoodie/.index` (this path can be configurable).
+3. Each SI will be a partition inside Hudi MDT. Index data will not be 
materialized with the base table's data files.
+4. Logical plan of a query will be used to efficiently filter FileGroups based 
on the query predicate and the available
+indexes.
+
+### SQL
+SI can be created using the regular `CREATE INDEX` SQL statement.
+```
+-- PROPOSED SYNTAX WITH `secondary_index` as the index type --
+CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name [USING 
secondary_index](index_column)
+-- Examples --
+CREATE INDEX idx_city on hudi_table USING secondary_index(city)
+CREATE INDEX idx_last_name on hudi_table (last_name)
+
+-- NO CHANGE IN DROP INDEX --
+DROP INDEX idx_city;
+```
+
+`index_name` - Required and validated by parser. `index_name` will be used to 
derive the name of the physical partition
+in MDT by prefixing `secondary_index_`. If the `index_name` is `idx_city`, 
then the MDT partition will be 
+`secondary_index_idx_city`
+
+The index_type will be `secondary_index`. This will be used to distinguish SI 
from other Functional Indexes.
+
+### Secondary Index Metadata
+Secondary index metadata will be managed the same way as Functional Index 
metadata. Since SI will not have any function
+to be applied on each row, the `function_name` will be NULL.
+
+### Index in Metadata Table (MDT)
+Each SI will be stored as a physical partition in the MDT. The partition name 
is derived from the `index_name` by 
+prefixing `secondary_index_`. Each entry in the SI partition will be a mapping 
of the form 
+`secondary_key -> record_key`. `secondary_key` will form the "record key" for 
the record of the SI partition. Note that
+an important design consideration here is that users may choose to build SI on 
a non-unique column of the table.
+
+#### Index Initialisation
+Initial build of the secondary index will scan all file slices (of the base 
table) to extract 
+`secondary-key -> record-key` tuple and write it into the secondary index 
partition in the metadata table. 
+This is similar to how RLI is initialised.
+
+#### Index Maintenance
+The index needs to be updated on inserts, updates and deletes to the base 
table. Considering that secondary-keys in 
+the base table could be non-unique, this process differs significantly 
compared to RLI.
+
+##### Inserts (on the base table)
+Newly inserted row's record-key and secondary-key is required to build the 
secondary-index entry. The record key is 
+already stored in the `WriteStatus` and commit metadata has the files touched 
by that commit. `WriteStatus` will be enhanced to store the secondary-key 
values (for all
+those columns on which secondary index is defined). The metadata writer will 
extract this information and write it out 
+to the secondary index partition. [1]
+
+##### Updates (on the base table)
+Similar to inserts, the `secondary-key -> record-key` tuples are extracted 
from the WriteStatus. However, additional 
+information needs to be emitted to aid compaction and merging (of the MDT 
partition hosting the SI). 
+
+Let's take an example where a row's `secondary-key` column is updated from its 
old value of 
+`old-secondary-key -> record-key` to `new-secondary-key -> record-key`. Note 
that as the secondary keys are updated, 
+the `new-secondary-key` could get mapped to a different FileGroup (in MDT 
partition hosting the SI) as compared to 
+`old-secondary-key`. There needs to be a mechanism to clean up the 
`old-secondary-key -> record_key` mapping (from the 
+MDT partition hosting SI). The proposal is to emit a tombstone entry 
`old-secondary-key -> (record-key ,deleted)`. This 
+is then followed by writing the `new-secondary-key -> record-key`. The log 
record reader (merger) and compactor should 
+correctly use the tombstone marker to remove stale entries from the old 
FileGroup. Note that since the mapping of a 
+record (of the MDT) to a FileGroup is based on the `secondary-key`, it is 
guaranteed that 
+`old-secondary-key -> (record-key, deleted)` tombstone record will get mapped 
to the FileGroup containing the 
+`old-secondary-key, record-key` entry.
+
+Another key observation here is that `old-secondary-key` is required to 
construct the tombstone record. Unlike some other 
+data systems, Hudi does not read the old-image of a row on updates until a 
merge is executed. It detects that a row is getting updated by simply 
+reading the index and appending the updates in log files. Hence, there needs 
to be a mechanism to extract `old-secondary-key`. We propose
+`old-secondary-key` to be extracted by scanning the MDT partition (hosting the 
SI) and doing a reverse lookup based 
+on the `record-key` of the row being updated. It should be noted that this is 
going to be expensive operation as the 
+base table grows in size (which inherently means that SI will grow in size) in 
terms of number of rows. One way to 
+optimize this is to build a reverse mapping `record-key -> secondary-key` in a 
different MDT partition. This is 
+left as a TBD (as of this writing).
+
+##### Deletes (on the base table)
+This is handled the same way updates are handled. Only `old-secondary-key -> 
(record-key, deleted)` tuple is emitted 
+as a tombstone record to the FileGroup containing the `old-secondary-key -> 
record-key` entry.
+
+### Query Execution
+Secondary index will be used to prune the candidate files (to scan) if the 
query has  an "EQUAL TO" or "IN" predicate
+on `secondary-key` columns. The initial implementation will follow the same 
logic (and constraints) used by RLI based 
+data skipping(like single column keys, rule-based index selection etc.). In 
the first step, we scan the SI partition
+(in MDT) and filter all entries that match the secondary keys specified in the 
query predicate. We collect all the 
+record keys (after filtering). Subsequently, these record keys are used to 
look up the record location using RLI.
+
+### Merging of secondary index records (on the read-path or during compaction)
+Record mergers are used in MOR tables by the readers to merge log files and 
base files. Similarity of the 'keys' in 
+records is used to identify candidate records that need to be merged. The 
'key' for RLI entry is the 
+`record-key` and by definition it is unique. But, the keys for secondary index 
entries are the `secondary-keys` which 
+can be non-unique. Hence, the merging of SI entries will make use of the  
payload i.e `record-key` in the
+`secondary-key -> record-key` tuple to identify candidate records that need to 
be merged. It will also be guided by the 
+tombstone record emitted during update or deletes. An example is provided here 
on how the different log files are merged and how the merged log 
+records are finally merged with the base file to obtain the merged records (of 
the MDT partition hosting SI).
+
+Consider the following table, `trips_table`. Note that this table is only used 
to illustrate the merging logic and not 
+to be used as a definitive table for other considertaion (for example, the 
performance aspect of some of the algorithm 
+chosen will depend on the cardinality of the `record-key` and `secondary-key` 
columns).
+
+| Column Name | Column Type |
+|-------------|-------------|
+| uuid        | string      |
+| ts          | long        |
+| rider       | string      |
+| city        | string      |
+
+`uuid` is the `record-key` column and a SI is built on the `city` column. Few 
details to be noted:
+1. Base files in the MDT are sorted (HFile) by the 'key' for the records 
stored in that MDT partition. For the MDT 
+partition hosting SI, records are sorted by `secondary-key`.
+2. Base file will never contain a tombstone record (record mergers never write 
tombstone or deleted records 
+into base file)
+
+Base File Entries:
+```
+chennai -> c8abbe79-8d89-47ea-b4ce-4d224bae5bfa
+los-angeles -> 9909a8b1-2d15-4d3d-8ec9-efc48c536a01
+los-angeles -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df330
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df329
+```
+
+Log File Entries:
+```
+sfo -> (334e26e9-8355-45cc-97c6-c31daf0df329, deleted)
+chennai -> e3cf430c-889d-4015-bc98-59bdce1e530c
+los-angeles -> (9809a8b1-2d15-4d3d-8ec9-efc48c536a01, deleted)
+austin -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+```
+
+Merged Records (sorted):
+``` 
+austin -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+chennai -> e3cf430c-889d-4015-bc98-59bdce1e530c
+chennai -> c8abbe79-8d89-47ea-b4ce-4d224bae5bfa
+los-angeles -> 9909a8b1-2d15-4d3d-8ec9-efc48c536a01
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df330
+```
+
+An important detail to be noted here is that grouping for merging depends on 
the `secondary-key, record-key` tuple (`city, uuid` 
+in this example). This is radically different from any other existing 
partitions in Hudi (MDT or base tables). Hence, 
+support needs to be added to merge records from MDT partition hosting SI (with 
generalized approach that can be used for
+any future non-unique record key based dataset).
+
+The data structure built for holding the merged log records should enable to 
search similar keys efficiently with 
+following consideration:
+1. Should allow spilling to disk (to prevent OOM). A pure in-memory data 
structure will not be production ready, but 
+can be used only for POC
+2. Should allow for searching (and retrieving) based on two keys as 
efficiently as possible. First search will be based 
+on `secondary-key` and second search will be based on `record-key`. Hence, a 
single level Map used by 
+`HoodieMergedLogRecordScanner` will not work.
+3. Should allow for efficient removal of records (deleted record need to be 
removed). Hence, any data structure that 
+uses a flat array will not be efficient. 
+4. Should allow for efficient insertion of records (for inserting merged 
record and for buffering fresh records). 
+
+The [initial POC](https://github.com/apache/hudi/pull/10625) makes use of an 
in-memory nested maps - with the first level keyed by `secondary-key`  
+and the second level keyed by`record-key`. However, the final design should 
allow spilling to disk.
+
+Considering the above requirements, the proposal is to extend `HoodieKey` to 
support secondary keys. We will make `HoodieKey` generic and create a subclass 
`HoodieSecondaryKey` where the key can be a generic type.

Review Comment:
   Not doing anymore. Introduced a new hierarchy as suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7146] [RFC-77] RFC for secondary index [hudi]

Reply via email to