[ 
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6701:
---------------------------------
    Description: 
Today, we auto generate string keys of the form 
(HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
to uuidv1, when we store as a string column inside a parquet file.

{code:java}
  public static String generateSequenceId(String instantTime, int partitionId, 
long recordIndex) {
    return instantTime + "_" + partitionId + "_" + recordIndex;
  }
{code}

As a part of this task, we'd love to understand if 

- Can uuid6 or 7, provide similar compressed storage footprint when written as 
a column in a parquet file. 
- can the current format be represented as a 160-bit number i.e 2 longs, 1 int 
in storage? would that save us further in storage costs?  

(Orthogonal consideration is the memory needed to hold the key string, which 
can be higher than a 160bits. We can discuss this later, once we understand 
storage footprint) 
 
Resources:
* https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
* https://github.com/uuid6/uuid6-ietf-draft
* https://github.com/uuid6/prototypes 



  was:
Today, we auto generate string keys of the form, which is highly compressible, 
esp compared to uuidv1, when we store as a string column inside a parquet file.

{code:java}
  public static String generateSequenceId(String instantTime, int partitionId, 
long recordIndex) {
    return instantTime + "_" + partitionId + "_" + recordIndex;
  }
{code}

As a part of this task, we'd love to understand if 

- Can uuid6 or 7, provide similar compressed storage footprint when written as 
a column in a parquet file. 
- can the current format be represented as a 160-bit number i.e 2 longs, 1 int 
in storage? would that save us further in storage costs?  

(Orthogonal consideration is the memory needed to hold the key string, which 
can be higher than a 160bits. We can discuss this later, once we understand 
storage footprint) 
 
Resources:
* https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
* https://github.com/uuid6/uuid6-ietf-draft
* https://github.com/uuid6/prototypes 




> Explore use of UUID-6/7 as a replacement for current auto generated keys
> ------------------------------------------------------------------------
>
>                 Key: HUDI-6701
>                 URL: https://issues.apache.org/jira/browse/HUDI-6701
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Vinoth Chandar
>            Assignee: Lin Liu
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form 
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
>   public static String generateSequenceId(String instantTime, int 
> partitionId, long recordIndex) {
>     return instantTime + "_" + partitionId + "_" + recordIndex;
>   }
> {code}
> As a part of this task, we'd love to understand if 
> - Can uuid6 or 7, provide similar compressed storage footprint when written 
> as a column in a parquet file. 
> - can the current format be represented as a 160-bit number i.e 2 longs, 1 
> int in storage? would that save us further in storage costs?  
> (Orthogonal consideration is the memory needed to hold the key string, which 
> can be higher than a 160bits. We can discuss this later, once we understand 
> storage footprint) 
>  
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to