[hudi] branch master updated: [HUDI-53] Update RFC-8 for Metadata based Record Index (#9775)

yihua Thu, 28 Sep 2023 14:24:15 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new af7caa9382b [HUDI-53] Update RFC-8 for Metadata based Record Index 
(#9775)
af7caa9382b is described below

commit af7caa9382bb7918234d0df8116b6fddf5135b47
Author: Prashant Wason <[email protected]>
AuthorDate: Thu Sep 28 14:24:04 2023 -0700

    [HUDI-53] Update RFC-8 for Metadata based Record Index (#9775)
---
 rfc/README.md                       |   2 +-
 rfc/rfc-8/metadata_record_index.jpg | Bin 0 -> 42413 bytes
 rfc/rfc-8/rfc-8.md                  | 219 ++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+), 1 deletion(-)

diff --git a/rfc/README.md b/rfc/README.md
index 073ae68d480..1afb3ae61ac 100644
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -43,7 +43,7 @@ The list of all RFCs can be found here.
 | 5          | [HUI (Hudi 
WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233)
                                                                                
                                   | `ABANDONED`    |
 | 6          | [Add indexing support to the log 
file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file)
                                                                       | 
`ABANDONED`    |
 | 7          | [Point in time Time-Travel queries on Hudi 
table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table)
                                                 | `COMPLETED`    |
-| 8          | [Record level indexing mechanisms for Hudi 
datasets](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets)
                                                | `ONGOING`      |
+| 8          | [Metadata based Record Index](./rfc-8/rfc-8.md)                 
                                                                                
                                                                 | `COMPLETED`  
  |
 | 9          | [Hudi Dataset Snapshot 
Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter)
                                                                                
   | `COMPLETED`    |
 | 10         | [Restructuring and auto-generation of 
docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs)
                                                             | `COMPLETED`    |
 | 11         | [Refactor of the configuration framework of hudi 
project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project)
                                 | `ABANDONED`    |
diff --git a/rfc/rfc-8/metadata_record_index.jpg 
b/rfc/rfc-8/metadata_record_index.jpg
new file mode 100644
index 00000000000..52083e81728
Binary files /dev/null and b/rfc/rfc-8/metadata_record_index.jpg differ
diff --git a/rfc/rfc-8/rfc-8.md b/rfc/rfc-8/rfc-8.md
new file mode 100644
index 00000000000..4fe06139018
--- /dev/null
+++ b/rfc/rfc-8/rfc-8.md
@@ -0,0 +1,219 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-8: Metadata based Record Index
+
+## Proposers
+- @prashantwason
+
+## Approvers
+
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-53
+
+
+## Abstract
+HUDI requires an [Index](https://hudi.apache.org/docs/indexing) during updates 
to locate the existing records by their 
+unique record keys. The HUDI Index is a mapping of the record-key to record's 
file path. Hudi supports several indexes 
+like:
+ 1. Bloom Index: Employs bloom filters built out of the record keys, 
optionally also pruning candidate files using record key ranges.
+ 2. Simple Index (default): Performs a lean join of the incoming update/delete 
records against keys extracted from the table on storage.
+ 3. HBase Index: Manages the index mapping in an external Apache HBase table.
+
+We are proposing a new Index called Record Index which will save the record 
key to file path location within the 
+[HUDI Metadata Table](https://hudi.apache.org/docs/metadata). Since the HUDI 
Metadata Table is internal to a HUDI Dataset, 
+the Record Index is updated and queried using the resources already available 
to the HUDI dataset.
+
+
+## Justification
+
+Bloom and Simple Index are slow for large datasets as they have high costs 
involved in gathering the index data from various
+data files at lookup time. Furthermore, these indexes do not save a one-to-one 
record-key to record file path mapping but
+deduce the mapping via an optimized search at lookup time. A per file overhead 
required in these indexes means that datasets 
+with larger number of files or number of records will not work well with these 
indexes. 
+
+The Hbase Index saves one to one mapping for each record key so is very fast 
and scaled with the dataset size. But Hbase 
+Index requires a separate HBase cluster to be maintained. HBase is 
operationally difficult to maintain and scale for throughput, 
+requires dedicated resources and expertise to maintain.
+
+The Record Index will provide the speed and scalability of HBase Index without 
all the limitation and overhead. Since 
+the HUDI Metadata Table is a HUDI Table, all future performance improvements 
in writes and queries will automatically 
+provide those improvements to Record Index performance. 
+
+## Design
+Record Index will save the record-key to file path mapping in a new partition 
within the HUDI Metadata Table. Metadata table
+uses HBase HFile - the tree map file format to store and retrieve data. HFile 
is an indexed file format
+and supports map like faster lookups by keys. Since, we will be storing 
mapping for every single record key, Record Index
+lookups for large number of keys transform into direct lookups of keys from 
HUDI Metadata Table and should be able to 
+benefit greatly from the faster lookups in HFile.
+
+
+### Metadata Table partitioning and schema:
+
+A new partition `record_index` will be added under the metadata table. The 
existing metadata table payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the record_index payload record.
+Here is the schema for the record_index payload record.
+```
+    {
+        "name": "recordIndexMetadata",
+        "doc": "Metadata Index that contains information about record keys and 
their location in the dataset",
+        "type": [
+            "null",
+             {
+               "type": "record",
+               "name": "HoodieRecordIndexInfo",
+                "fields": [
+                    {
+                        "name": "partition",
+                        "type": "string",
+                        "doc": "Partition which contains the record",
+                        "avro.java.string": "String"
+                    },
+                    {
+                        "name": "fileIdHighBits",
+                        "type": "long",
+                        "doc": "fileId which contains the record (high 64 
bits)"
+                    },
+                    {
+                        "name": "fileIdLowBits",
+                        "type": "long",
+                        "doc": "fileId which contains the record (low 64 bits)"
+                    },
+                    {
+                        "name": "fileIndex",
+                        "type": "int",
+                        "doc": "index of the file"
+                    },
+                    {
+                        "name": "instantTime",
+                        "type": "long",
+                        "doc": "Epoch time in millisecond at which record was 
added"
+                    }
+                ]
+            }
+        ],
+        "default" : null
+    }
+```
+
+The key for the record index record would be the actual key from the record. 
The partition name is also saved as string.
+HUDI base files names have a format which includes a UUID fileID, an integer 
file Index, a write token and a timestamp. 
+The record index payload only saves the fileID and file index information. The 
fileID is split into UUID and the integer file index. The UUID is encoded into 
two longs and the file index is saved
+as an integer. The timestamp is encoded into epoch time in milliseconds.
+
+This schema format is chosen to minimize the data size of each mapping to 
ensure the smallest possible size of the 
+record index even for datasets with billions of records. 
+
+Experiments have shown that with random UUID record keys and datestr 
partitions (YYYY/MM/DD), we can achieve an average
+size of 50 to 55 bytes per mapping saved in the record index. The size might 
even be lower for keys which may compress better.
+
+Below picture gives a pictorial representation of record index partition in 
metadata table.
+<img src="metadata_record_index.jpg" alt="Record Index Partition" width="800"/>
+
+
+### Record Index initialization:
+
+Like any other HUDI Metadata Table index, the record index can be initialized 
inline (before the writer writes records to the dataset)
+or via the Async Indexer.
+
+The initialization involves the following steps:
+1. Get the list of all files in the dataset 
+   1. Since the `files` partition is a pre-requisite for all other partitions 
in Metadata Table, the list of all files can be taken from the Metadata Table 
itself and does not involve listing the entire dataset.
+2. Read the record keys from all the files in the dataset
+   1. Only the record key column needs to be read from the base files. 
+   2. This step scales with more Executors and more memory
+3. Determine the number of fileGroups to use for the `record index` partition
+4. Create record index records corresponding to each record key read
+5. Insert the records into the Metadata Table partition `record index`
+
+We will add functionality to automatically estimate the number of fileGroups 
to use for the `record index` partition based
+on the number of records in the dataset (available after Step 2 above). This 
should simplify rollout as the user does not
+have to worry about the number of fileGroups for optimal performance. Configs 
will allow specifying the number of fileGroups
+too.
+
+
+### Metadata Index lookup:
+
+For the incoming upsert records, given their keys, tag their current location. 
The key lookup would require the following steps:
+
+1. Generate the list of keys to be looked up (extract HoodieKeys from the 
upsert records)
+2. Lookup all the keys from the HUDI Metadata Table
+   1. Keys are partitioned based on the hash as the HUDI Metadata Table 
mappings are saved in various fileGroups (count fixed at initialization time) 
with each fileGroup saving a portion of the key space
+   2. Each partition of keys is looked up in parallel from the fileGroup using 
various Executors
+3. Tag the location, where a mapping was found in HUDI Metadata Table, back to 
upsert records 
+
+Given N fileGroups in the record index, an indexing lookup of M keys is 
reduced to N lookups of M/N keys in parallel. Hence, 
+for fastest lookup operation, the number of executors for the writer process 
should be >= N.
+
+This also means that lookup from record index can be scaled with growing data 
size by:
+1. Increasing the number of fileGroups (N in the above example)
+2. Using at least N or greater executors for the indexing process
+
+HDFS based experiments have shown than on average key lookups from HFile in 
HUDI Metadata Table complete in 1-2msec. 
+So for lookup of M keys we expect ballpark time of K + M / N * 2msec where K 
is the overhead of opening HFile (~100msec) 
+and merging the log files. Periodic compaction of Metadata Table keeps the 
value of K lower.
+
+
+## Implementation
+1. No changes to the HoodieIndex public interface.
+2. A new index type will be added - RECORD_LEVEL
+
+
+### Writer flow:
+Let's walk through the writer flow to update the record index.
+
+Whenever a new commit is getting applied to metadata table, we do the 
following.<br>
+1. Parse the WriteStatus to determine the record which have been inserted into 
the dataset
+   1. Such records have new location (HoodieRecord::getNewLocation()) but no 
current location ((HoodieRecord::getCurrentLocation()))
+2. Create new records for each record keys being added to the dataset
+3. Commit all these records to metadata table.
+
+We need to ensure that WriteStatus tracks all written records keys for every 
commit.
+
+
+### Reader flow:
+When a new batch of write is ingested into Hudi, we need to tag the records 
with their
+original file group location. Refer to Metadata Index lookup section for more 
details.
+
+
+### Limitations:
+1. The number of file groups are fixed at the time of initialization and there 
is no support for dynamically increasing or decreasing the number of file 
groups.
+2. If the total number of records in the dataset grows by large factor, the 
number of file groups might need to be increased to maintain the same 
performance.
+   1. This currently requires re-initialization of the record index.
+3. Record Index is a global index and hence requires unique keys in the dataset
+
+
+### Future Improvements:
+1. Add support for non-global index
+2. Add support for indexing only a window of days rather than the entire 
dataset.
+   1. This will allow the record index to be used efficiently for datasets 
where dedupe is required on the last N days of data.
+3. Add support for dynamically increasing or decreasing the number of file 
groups.
+
+
+## Rollout/Adoption Plan
+* Record Index will be available in 0.14.0 release
+* Metadata Table scheme will be upgraded as part of release upgrade process
+* Record Index will be disabled by default and can be enabled by setting the 
write configs 
+
+
+## Test Plan
+* Functionality
+    * Tag location for existing keys
+    * Tag location for non-existing keys
+* Performance
+    * Prove Metadata based indices are helping upsert use cases
+* Upgrade

[hudi] branch master updated: [HUDI-53] Update RFC-8 for Metadata based Record Index (#9775)

Reply via email to