[
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757001#comment-17757001
]
Lin Liu commented on HUDI-6701:
-------------------------------
Will discuss the next step in today's sync up.
> Explore use of UUID-6/7 as a replacement for current auto generated keys
> ------------------------------------------------------------------------
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Lin Liu
> Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
> public static String generateSequenceId(String instantTime, int
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
> }
> {code}
> As a part of this task, we'd love to understand if
> - Can uuid6 or 7, provide similar compressed storage footprint when written
> as a column in a parquet file.
> - can the current format be represented as a 160-bit number i.e 2 longs, 1
> int in storage? would that save us further in storage costs?
> (Orthogonal consideration is the memory needed to hold the key string, which
> can be higher than a 160bits. We can discuss this later, once we understand
> storage footprint)
>
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes
--
This message was sent by Atlassian Jira
(v8.20.10#820010)