Sagar Sumit created HUDI-7145:
---------------------------------

             Summary: Support for grouping values for same key in HFile
                 Key: HUDI-7145
                 URL: https://issues.apache.org/jira/browse/HUDI-7145
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Sagar Sumit
             Fix For: 1.0.0


Hudi writes metadata table (MT) base files in HFile format. HFile stores sorted 
key-value pairs. For the existing MT partitions, the key is guaranteed to be 
unique. However, for secondary index, it is very likely that the same value of 
secondary index field is in multiple files.

This ticket is to microbenchmark two approaches of storing secondary index:
 # Group all values for a key and then store key-value pairs where each value 
in this pair is a collection. For example, say column c1 is the secondary index 
clumn with values v1 in files f1, f2 and value v2 in file f2. Then this 
approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) 
v2: [f2].
 # Since each key-value pair is unique as a whole, so store each key-value pair 
separately (still lexicographically sorted). So, in this approach, we have 3 
entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.

The benchmark should capture storage overhead and lookup latency of one 
approach over the other.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to