[
https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sagar Sumit closed HUDI-7145.
-----------------------------
Resolution: Done
> Support for grouping values for same key in HFile
> -------------------------------------------------
>
> Key: HUDI-7145
> URL: https://issues.apache.org/jira/browse/HUDI-7145
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Sagar Sumit
> Assignee: Sagar Sumit
> Priority: Major
> Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Hudi writes metadata table (MT) base files in HFile format. HFile stores
> sorted key-value pairs. For the existing MT partitions, the key is guaranteed
> to be unique. However, for secondary index, it is very likely that the same
> value of secondary index field is in multiple files.
> This ticket is to microbenchmark two approaches of storing secondary index:
> # Group all values for a key and then store key-value pairs where each value
> in this pair is a collection. For example, say column c1 is the secondary
> index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this
> approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii)
> v2: [f2].
> # Since each key-value pair is unique as a whole, so store each key-value
> pair separately (still lexicographically sorted). So, in this approach, we
> have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
> The benchmark should capture storage overhead and lookup latency of one
> approach over the other.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)