[jira] [Closed] (HUDI-7145) Support for grouping values for same key in HFile

Sagar Sumit (Jira) Wed, 03 Apr 2024 18:46:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sagar Sumit closed HUDI-7145.
-----------------------------
    Resolution: Done

> Support for grouping values for same key in HFile
> -------------------------------------------------
>
>                 Key: HUDI-7145
>                 URL: https://issues.apache.org/jira/browse/HUDI-7145
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: hudi-1.0.0-beta2
>             Fix For: 1.0.0
>
>
> Hudi writes metadata table (MT) base files in HFile format. HFile stores 
> sorted key-value pairs. For the existing MT partitions, the key is guaranteed 
> to be unique. However, for secondary index, it is very likely that the same 
> value of secondary index field is in multiple files.
> This ticket is to microbenchmark two approaches of storing secondary index:
>  # Group all values for a key and then store key-value pairs where each value 
> in this pair is a collection. For example, say column c1 is the secondary 
> index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this 
> approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) 
> v2: [f2].
>  # Since each key-value pair is unique as a whole, so store each key-value 
> pair separately (still lexicographically sorted). So, in this approach, we 
> have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
> The benchmark should capture storage overhead and lookup latency of one 
> approach over the other.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-7145) Support for grouping values for same key in HFile

Reply via email to