[
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757867#comment-17757867
]
Lin Liu commented on HUDI-6701:
-------------------------------
Based on our discussion and experiments, we conclude that we will keep the
current row key format since it reaches the sweet spot between time and storage
tradeoffs. This task is closed for now.
> Explore use of UUID-6/7 as a replacement for current auto generated keys
> ------------------------------------------------------------------------
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Vinoth Chandar
> Assignee: Lin Liu
> Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
> public static String generateSequenceId(String instantTime, int
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
> }
> {code}
> As a part of this task, we'd love to understand if
> - Can uuid6 or 7, provide similar compressed storage footprint when written
> as a column in a parquet file.
> - can the current format be represented as a 160-bit number i.e 2 longs, 1
> int in storage? would that save us further in storage costs?
> (Orthogonal consideration is the memory needed to hold the key string, which
> can be higher than a 160bits. We can discuss this later, once we understand
> storage footprint)
>
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes
--
This message was sent by Atlassian Jira
(v8.20.10#820010)