Vinoth Chandar created HUDI-6701:
------------------------------------
Summary: Explore use of UUID-6/7 as a replacement for current auto
generated keys
Key: HUDI-6701
URL: https://issues.apache.org/jira/browse/HUDI-6701
Project: Apache Hudi
Issue Type: Improvement
Reporter: Vinoth Chandar
Assignee: Lin Liu
Fix For: 1.0.0
Today, we auto generate string keys of the form, which is highly compressible,
esp compared to uuidv1, when we store as a string column inside a parquet file.
{code:java}
public static String generateSequenceId(String instantTime, int partitionId,
long recordIndex) {
return instantTime + "_" + partitionId + "_" + recordIndex;
}
{code}
As a part of this task, we'd love to understand if
- Can uuid6 or 7, provide similar compressed storage footprint when written as
a column in a parquet file.
- can the current format be represented as a 160-bit number i.e 2 longs, 1 int
in storage? would that save us further in storage costs?
(Orthogonal consideration is the memory needed to hold the key string, which
can be higher than a 160bits. We can discuss this later, once we understand
storage footprint)
Resources:
* https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/
* https://github.com/uuid6/uuid6-ietf-draft
* https://github.com/uuid6/prototypes
--
This message was sent by Atlassian Jira
(v8.20.10#820010)