Sagar Sumit created HUDI-7145:
---------------------------------
Summary: Support for grouping values for same key in HFile
Key: HUDI-7145
URL: https://issues.apache.org/jira/browse/HUDI-7145
Project: Apache Hudi
Issue Type: Task
Reporter: Sagar Sumit
Fix For: 1.0.0
Hudi writes metadata table (MT) base files in HFile format. HFile stores sorted
key-value pairs. For the existing MT partitions, the key is guaranteed to be
unique. However, for secondary index, it is very likely that the same value of
secondary index field is in multiple files.
This ticket is to microbenchmark two approaches of storing secondary index:
# Group all values for a key and then store key-value pairs where each value
in this pair is a collection. For example, say column c1 is the secondary index
clumn with values v1 in files f1, f2 and value v2 in file f2. Then this
approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii)
v2: [f2].
# Since each key-value pair is unique as a whole, so store each key-value pair
separately (still lexicographically sorted). So, in this approach, we have 3
entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
The benchmark should capture storage overhead and lookup latency of one
approach over the other.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)