(hudi) branch master updated: [HUDI-7146] [RFC-77] RFC for secondary index (#10814)

vinoth Thu, 12 Dec 2024 11:44:01 -0800

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new ff88dfe5f53 [HUDI-7146] [RFC-77] RFC for secondary index (#10814)
ff88dfe5f53 is described below

commit ff88dfe5f5324f8999e4b7665eb86b7e634b3ce6
Author: bhat-vinay <[email protected]>
AuthorDate: Fri Dec 13 01:13:46 2024 +0530

    [HUDI-7146] [RFC-77] RFC for secondary index (#10814)
    
    * [HUDI-7146] [RFC-77] RFC for secondary index
    
    Abstract:
    In this RFC, we propose implementing Secondary Indexes (SI), a new 
capability in Hudi's metadata table (MDT) based indexing
    system.  SI are indexes defined on user specified columns of the table. 
Similar to record level indexes,
    SI will improve query performance when the query predicate contains 
secondary keys. The number of files
    that a query needs to scan can be pruned down using secondary indexes.
    
    * Some corrections
    
    * Update RFC with key changes
    
    * Address comments
    
    ---------
    
    Co-authored-by: Vinaykumar Bhat <[email protected]>
    Co-authored-by: Sagar Sumit <[email protected]>
---
 rfc/rfc-77/rfc-77.md | 345 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 345 insertions(+)

diff --git a/rfc/rfc-77/rfc-77.md b/rfc/rfc-77/rfc-77.md
new file mode 100644
index 00000000000..dd488033ecb
--- /dev/null
+++ b/rfc/rfc-77/rfc-77.md
@@ -0,0 +1,345 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-77: Secondary Indexes
+
+## Proposers
+
+- @bhat-vinay
+- @codope
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7146
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+In this RFC, we propose implementing Secondary Indexes (SI), a new capability 
in Hudi's metadata table (MDT) based indexing 
+system.  SI are indexes defined on user specified columns of the table. 
Similar to record level indexes,
+SI will improve query performance when the query predicate contains secondary 
keys. The number of files
+that a query needs to scan can be pruned down using secondary indexes.
+
+## Background
+
+Hudi supports different indexes through its MDT. These indexes help to improve 
query performance by
+pruning down the set of files that need to be scanned to build the result set 
(of the query). 
+
+One of the supported index in Hudi is the Record Level Index (RLI). RLI acts 
as a unique-key index and can be used to 
+locate a FileGroup of a record based on its RecordKey. A query having an EQUAL 
or IN predicate on the RecordKey will 
+have a performance boost as the RLI can accurately give a subset of FileGroups 
that contain the rows matching the 
+predicate.
+
+Many workloads have queries with predicates that are not based on RecordKey. 
Such queries cannot use RLI for data
+skipping. Traditional databases have a notion of building indexes (called 
Secondary Index or SI) on user specified 
+columns to aid such queries. This RFC proposes implementing SI in Hudi. Users 
can build SI on columns which are 
+frequently used as filtering columns (i.e columns on which query predicate is 
based on). As with any other index, 
+building and maintaining SI adds overhead on the write path. Users should 
choose wisely based 
+on their workload. Tools can be built to provide guidance on the usefulness of 
indexing a specific column, but it is 
+not in the scope of this RFC.
+
+## Design and Implementation
+This section discusses briefly the goals, design, implementation details of 
supporting SI in Hudi. At a high level,
+the design principle and goals are as follows:
+1. User specifies SI to be built on a given column of a table. A given SI can 
be built on only one column of the table
+(i.e composite keys are not allowed). Any number of SI can be built on a Hudi 
table. The indexes to be built are 
+specified using regular SQL statements.
+2. Metadata of a SI will be tracked through the index metadata file under 
`<base_path>/.hoodie/.index` (this path can be configurable).
+3. Each SI will be a partition inside Hudi MDT. Index data will not be 
materialized with the base table's data files.
+4. Logical plan of a query will be used to efficiently filter FileGroups based 
on the query predicate and the available
+indexes.
+
+### SQL
+SI can be created using the regular `CREATE INDEX` SQL statement.
+```
+-- PROPOSED SYNTAX WITH `secondary_index` as the index type --
+CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name [USING 
secondary_index](index_column)
+-- Examples --
+CREATE INDEX idx_city on hudi_table USING secondary_index(city)
+CREATE INDEX idx_last_name on hudi_table (last_name)
+
+-- NO CHANGE IN DROP INDEX --
+DROP INDEX idx_city;
+```
+
+`index_name` - Required and validated by parser. `index_name` will be used to 
derive the name of the physical partition
+in MDT by prefixing `secondary_index_`. If the `index_name` is `idx_city`, 
then the MDT partition will be 
+`secondary_index_idx_city`
+
+The index_type will be `secondary_index`. This will be used to distinguish SI 
from other Functional Indexes.
+
+### Secondary Index Metadata
+Secondary index metadata will be managed the same way as Functional Index 
metadata. Since SI will not have any function
+to be applied on each row, the `function_name` will be NULL.
+
+### Index in Metadata Table (MDT)
+Each SI will be stored as a physical partition in the MDT. The partition name 
is derived from the `index_name` by 
+prefixing `secondary_index_`. Each entry in the SI partition will be a mapping 
of the form 
+`secondary_key -> record_key`. `secondary_key` will form the "record key" for 
the record of the SI partition. Note that
+an important design consideration here is that users may choose to build SI on 
a non-unique column of the table.
+
+#### Index Initialisation
+Initial build of the secondary index will scan all file slices (of the base 
table) to extract 
+`secondary-key -> record-key` tuple and write it into the secondary index 
partition in the metadata table. 
+This is similar to how RLI is initialised.
+
+#### Index Maintenance
+The index needs to be updated on inserts, updates and deletes to the base 
table. Considering that secondary-keys in 
+the base table could be non-unique, this process differs significantly 
compared to RLI.
+
+##### Inserts (on the base table)
+Newly inserted row's record-key and secondary-key is required to build the 
secondary-index entry. The record key is 
+already stored in the `WriteStatus` and commit metadata has the files touched 
by that commit. `WriteStatus` will be enhanced to store the secondary-key 
values (for all
+those columns on which secondary index is defined). The metadata writer will 
extract this information and write it out 
+to the secondary index partition. [1]
+
+##### Updates (on the base table)
+Similar to inserts, the `secondary-key -> record-key` tuples are extracted 
from the WriteStatus. However, additional 
+information needs to be emitted to aid compaction and merging (of the MDT 
partition hosting the SI). 
+
+Let's take an example where a row's `secondary-key` column is updated from its 
old value of 
+`old-secondary-key -> record-key` to `new-secondary-key -> record-key`. Note 
that as the secondary keys are updated, 
+the `new-secondary-key` could get mapped to a different FileGroup (in MDT 
partition hosting the SI) as compared to 
+`old-secondary-key`. There needs to be a mechanism to clean up the 
`old-secondary-key -> record_key` mapping (from the 
+MDT partition hosting SI). The proposal is to emit a tombstone entry 
`old-secondary-key -> (record-key ,deleted)`. This 
+is then followed by writing the `new-secondary-key -> record-key`. The log 
record reader (merger) and compactor should 
+correctly use the tombstone marker to remove stale entries from the old 
FileGroup. Note that since the mapping of a 
+record (of the MDT) to a FileGroup is based on the `secondary-key`, it is 
guaranteed that 
+`old-secondary-key -> (record-key, deleted)` tombstone record will get mapped 
to the FileGroup containing the 
+`old-secondary-key, record-key` entry.
+
+Another key observation here is that `old-secondary-key` is required to 
construct the tombstone record. Unlike some other 
+data systems, Hudi does not read the old-image of a row on updates until a 
merge is executed. It detects that a row is getting updated by simply 
+reading the index and appending the updates in log files. Hence, there needs 
to be a mechanism to extract `old-secondary-key`. We propose
+`old-secondary-key` to be extracted by scanning the MDT partition (hosting the 
SI) and doing a reverse lookup based 
+on the `record-key` of the row being updated. It should be noted that this is 
going to be expensive operation as the 
+base table grows in size (which inherently means that SI will grow in size) in 
terms of number of rows. One way to 
+optimize this is to build a reverse mapping `record-key -> secondary-key` in a 
different MDT partition. This is 
+left as a TBD (as of this writing).
+
+##### Deletes (on the base table)
+This is handled the same way updates are handled. Only `old-secondary-key -> 
(record-key, deleted)` tuple is emitted 
+as a tombstone record to the FileGroup containing the `old-secondary-key -> 
record-key` entry.
+
+### Query Execution
+Secondary index will be used to prune the candidate files (to scan) if the 
query has  an "EQUAL TO" or "IN" predicate
+on `secondary-key` columns. The initial implementation will follow the same 
logic (and constraints) used by RLI based 
+data skipping(like single column keys, rule-based index selection etc.). In 
the first step, we scan the SI partition
+(in MDT) and filter all entries that match the secondary keys specified in the 
query predicate. We collect all the 
+record keys (after filtering). Subsequently, these record keys are used to 
look up the record location using RLI.
+
+### Merging of secondary index records (on the read-path or during compaction)
+Record mergers are used in MOR tables by the readers to merge log files and 
base files. Similarity of the 'keys' in 
+records is used to identify candidate records that need to be merged. The 
'key' for RLI entry is the 
+`record-key` and by definition it is unique. But, the keys for secondary index 
entries are the `secondary-keys` which 
+can be non-unique. Hence, the merging of SI entries will make use of the  
payload i.e `record-key` in the
+`secondary-key -> record-key` tuple to identify candidate records that need to 
be merged. It will also be guided by the 
+tombstone record emitted during update or deletes. An example is provided here 
on how the different log files are merged and how the merged log 
+records are finally merged with the base file to obtain the merged records (of 
the MDT partition hosting SI).
+
+Consider the following table, `trips_table`. Note that this table is only used 
to illustrate the merging logic and not 
+to be used as a definitive table for other considertaion (for example, the 
performance aspect of some of the algorithm 
+chosen will depend on the cardinality of the `record-key` and `secondary-key` 
columns).
+
+| Column Name | Column Type |
+|-------------|-------------|
+| uuid        | string      |
+| ts          | long        |
+| rider       | string      |
+| city        | string      |
+
+`uuid` is the `record-key` column and a SI is built on the `city` column. Few 
details to be noted:
+1. Base files in the MDT are sorted (HFile) by the 'key' for the records 
stored in that MDT partition. For the MDT 
+partition hosting SI, records are sorted by `secondary-key`.
+2. Base file will never contain a tombstone record (record mergers never write 
tombstone or deleted records 
+into base file)
+
+Base File Entries:
+```
+chennai -> c8abbe79-8d89-47ea-b4ce-4d224bae5bfa
+los-angeles -> 9909a8b1-2d15-4d3d-8ec9-efc48c536a01
+los-angeles -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df330
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df329
+```
+
+Log File Entries:
+```
+sfo -> (334e26e9-8355-45cc-97c6-c31daf0df329, deleted)
+chennai -> e3cf430c-889d-4015-bc98-59bdce1e530c
+los-angeles -> (9809a8b1-2d15-4d3d-8ec9-efc48c536a01, deleted)
+austin -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+```
+
+Merged Records (sorted):
+``` 
+austin -> 9809a8b1-2d15-4d3d-8ec9-efc48c536a01
+chennai -> e3cf430c-889d-4015-bc98-59bdce1e530c
+chennai -> c8abbe79-8d89-47ea-b4ce-4d224bae5bfa
+los-angeles -> 9909a8b1-2d15-4d3d-8ec9-efc48c536a01
+sfo -> 334e26e9-8355-45cc-97c6-c31daf0df330
+```
+
+An important detail to be noted here is that grouping for merging depends on 
the `secondary-key, record-key` tuple (`city, uuid` 
+in this example). This is radically different from any other existing 
partitions in Hudi (MDT or base tables). Hence, 
+support needs to be added to merge records from MDT partition hosting SI (with 
generalized approach that can be used for
+any future non-unique record key based dataset).
+
+The data structure built for holding the merged log records should enable to 
search similar keys efficiently with 
+following consideration:
+1. Should allow spilling to disk (to prevent OOM). A pure in-memory data 
structure will not be production ready, but 
+can be used only for POC
+2. Should allow for searching (and retrieving) based on two keys as 
efficiently as possible. First search will be based 
+on `secondary-key` and second search will be based on `record-key`. Hence, a 
single level Map used by 
+`HoodieMergedLogRecordScanner` will not work.
+3. Should allow for efficient removal of records (deleted record need to be 
removed). Hence, any data structure that 
+uses a flat array will not be efficient. 
+4. Should allow for efficient insertion of records (for inserting merged 
record and for buffering fresh records). 
+
+The [initial POC](https://github.com/apache/hudi/pull/10625) makes use of an 
in-memory nested maps - with the first level keyed by `secondary-key`  
+and the second level keyed by`record-key`. However, the final design should 
allow spilling to disk.
+
+Considering the above requirements, the proposal is to introduce a new class 
hierarchy for handling merge keys in a more
+flexible and decoupled manner. It adds the `HoodieMergeKey` interface, along 
with two
+implementations: `HoodieSimpleMergeKey` and `HoodieCompositeMergeKey`.
+```java
+public interface HoodieMergeKey extends Serializable {
+
+   /**
+    * Get the partition path.
+    */
+   String getPartitionPath();
+
+
+   /**
+    * Get the record key.
+    */
+   Serializable getRecordKey();
+
+   /**
+    * Get the hoodie key.
+    * For simple merge keys, this is used to directly fetch the HoodieKey, 
which is a combination of record key and partition path.
+    */
+   default HoodieKey getHoodieKey() {
+      return new HoodieKey(getRecordKey().toString(), getPartitionPath());
+   }
+}
+```
+
+`HoodieSimpleMergeKey` simply wraps `HoodieKey` for existing scenarios where 
the key is a
+string. `HoodieCompositeMergeKey` allows for complex types as keys, enhancing 
flexibility for scenarios where a simple
+string key is not sufficient.
+
+```java
+public class HoodieSimpleMergeKey implements HoodieMergeKey {
+
+  private final HoodieKey simpleKey;
+
+  public HoodieSimpleMergeKey(HoodieKey simpleKey) {
+    this.simpleKey = simpleKey;
+  }
+
+  @Override
+  public String getPartitionPath() {
+    return simpleKey.getPartitionPath();
+  }
+
+  @Override
+  public Serializable getRecordKey() {
+    return simpleKey.getRecordKey();
+  }
+
+  public HoodieKey getHoodieKey() {
+    return simpleKey;
+  }
+}
+
+public class HoodieCompositeMergeKey<K extends Serializable> implements 
HoodieMergeKey {
+
+  private final K compositeKey;
+  private final String partitionPath;
+
+  public HoodieCompositeMergeKey(K compositeKey, String partitionPath) {
+    this.compositeKey = compositeKey;
+    this.partitionPath = partitionPath;
+  }
+
+  @Override
+  public String getPartitionPath() {
+    return partitionPath;
+  }
+
+  @Override
+  public Serializable getRecordKey() {
+    return compositeKey;
+  }
+}
+```
+
+We also introduce a new `HoodieRecordMerger` implementation based on 
`HoodieMergeKey`. For other keys, it falls back to
+merge method of parent class. The new record merger will be used in 
`HoodieMergedLogRecordScanner` to merge records from
+MDT partition hosting SI.
+
+The primary advantage of this approach is that we do not leak any details to 
the upper layers such as merge handle.
+However, `HoodieMetadataLogRecordReader` should create the 
`HoodieMergedLogRecordScanner` with the
+correct `HoodieRecordMerger` implementation, instead of
+the [current record 
merger](https://github.com/apache/hudi/blob/cb6eb6785fdeb88e66016a2b8c0c6e6fa184b309/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataLogRecordReader.java#L156).
+
+These changes do not affect existing functionalities that do not rely on merge 
keys. It introduces additional classes
+that are used explicitly for new functionalities involving various key types 
in merging operations. This ensures minimal
+to no risk for existing processes.
+
+### Comparing alternate design proposals
+
+Here are some alternate options that we considered:
+
+1. Extend Hudi's `ExternalSpillableMap` to support multi-map. More signicant 
refactoring is required and it would have
+   leaked implementation details to the write handle layer, as the records 
held by `ExternalSpillableMap` is exposed to
+   write handle via `HoodieMeredLogRecordScanner::getRecords`.
+2. Write spillable version
+   of [Guava's 
multi-map](https://github.com/google/guava/wiki/NewCollectionTypesExplained#multimap).
 Apart from reason
+   mentioned above, we did not want to add a third-party dependency on Guava.
+3. Use [Chronicle map](https://github.com/OpenHFT/Chronicle-Map). Same reasons 
as above.
+4. Use two different spillable data structures - one is a set of 
`secondary-key` and the other is map of
+   `record-key -> (secondary-key, HoodieRecord)`. This would have been harder 
to maintain and the log scanner should
+   know when to use which data structure.
+
+## Test Plan
+
+1. The feature will be tested by functional tests and integration tests. 
+2. Indexing strategy always requires query correctness testing. One way to 
achieve that is to run the query against 
+same data set twice - once using data-skipping (based on SI) and the other 
based on full-table-scan - and compare the 
+result set. 
+3. Indexing strategy should be accompanied by performance test results showing 
its benefits on the query path (and optionally 
+overhead on the index maintenance (write) path)
+
+## Future enhancements and roadmap
+
+The feature can evolve to provide additional functionalities.
+1. As mentioned earlier, adding SI will incur cost - CPU and memory cost on 
write/ingestion side and the overall 
+storage cost to store the SI mapping. Tools can be built to estimate the 
storage cost upfront based on the current size
+of the dataset and user specified growth in the dataset size.
+2. Tools can be built to help users choose the right columns to build SI 
(based on the observed query pattern). There 
+are examples of such Automated Indexing Advisors - both in academia and 
industry.
+3. Efficient data structures to store the reverse mapping (i.e `record_key -> 
secondary_key`)
+4. Ability to build SI on composite keys
+5. Cost based index selection (on query execution)

(hudi) branch master updated: [HUDI-7146] [RFC-77] RFC for secondary index (#10814)

Reply via email to