[
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9541:
------------------------------
Description:
*{color:#ff0000}For reviewers to properly sign off, please comment with
"approve solution X" in the comment section.{color}*
h1. What's the issue
Here we assume the input is <sec key><separator><record key> and extracts the
<sec key> part.
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
// the payload key is in the format of "secondaryKey$primaryKey"
// we need to extract the secondary key from the payload key
checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
format for secondary index payload: " + key);
int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".
To make char like "$" to be special safely, we need to escape and unescape the
keys.
Otherwise for corner case like
<sec key> = 0$1
But what escape/unescape scheme we should use?
h1. Impact
Perf, maintainability
h1. Proposal
h2. Sol1: Escaping the string (current implmenetation)
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|
For null key, just like how we make $ special, we pick ASCII 0 as the special
char representing null str and escape ASCII 0 in normal string.
https://issues.apache.org/jira/browse/HUDI-9543
h3. 🛠 Encoding
{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
* Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
* Substring before → unescape to get secondary key.
* Substring after → unescape to get record key.
----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
int idx = key.indexOf("$s");
return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
int idx = key.indexOf("$s");
return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1
{{secKey = "$"}}
{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
* secondaryKey = {{"$"}}
* recordKey = {{"$"}}
h3. Example 2
{{secKey = "ab$cd"}}
{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
* secondaryKey = {{"ab$cd"}}
* recordKey = {{"e$f"}}
h3. Example 3
{{secKey = "$s"}}
{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
* secondaryKey = {{"$s"}}
* recordKey = {{"$s"}}
----
h3. ✔ Pros:
* {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.
* {*}Preserves human readability{*}: You can still identify the structure when
viewing raw keys in logs or S3.
* {*}No external dependencies{*}: Just string processing.
h3. ❌ Cons:
* {*}Custom escaping logic required{*}: More complicated logic to identify the
_first unescaped_ {{$}} reliably.
* {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong
(e.g., forgetting to unescape or double-escaping).
* {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find
the first _non-escaped_ delimiter.
h2. Sol 1.2 base 64 based encoding
# Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER +
Base64.encode(recordKey)`. Base64 does not map to $. So, this gives us a neat
and standard way to encode. Might not be very efficient for long strings? But,
base64 is a standard scheme.
# Escape special characters: `escapeSpecialChars(secondaryKey) + DELIMITER +
recordKey`. The keys are readable and preserves the order. This is a custom
scheme not used in other systems.
For null key we can use ASCII 0 as a special rule to encode.
h3. ✔ Pros:
* {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so
the separator is guaranteed safe.
* {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
* {*}Simple parsing{*}: Can safely use {{split()}} without risk.
* {*}Robust{*}: Handles any kind of string input, including control characters.
h3. ❌ Cons:
* {*}Longer keys{*}: Base64 encoding increases size by ~33%.
* {*}Less readable{*}: The keys are opaque.
* {*}S3 implications{*}: Longer object keys may slightly impact performance or
cost (e.g., S3 LIST operations), though this is usually minor.
was:
*{color:#ff0000}For reviewers to properly sign off, please comment with
"approve solution X" in the comment section.{color}*
h1. What's the issue
Here we assume the input is <sec key><separator><record key> and extracts the
<sec key> part.
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
// the payload key is in the format of "secondaryKey$primaryKey"
// we need to extract the secondary key from the payload key
checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
format for secondary index payload: " + key);
int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".
To make char like "$" to be special safely, we need to escape and unescape the
keys.
Otherwise for corner case like
<sec key> = 0$1
the function returns the "0" as sec key which is apparently wrong.
But what escape/unescape scheme we should use?
h1. Impact
Perf, maintainability
h1. Proposal
h2. Sol1: Escaping the string (current implmenetation)
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|
For null key, just like how we make $ special, we pick ASCII 0 as the special
char representing null str and escape ASCII 0 in normal string.
https://issues.apache.org/jira/browse/HUDI-9543
h3. 🛠 Encoding
{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
* Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
* Substring before → unescape to get secondary key.
* Substring after → unescape to get record key.
----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
int idx = key.indexOf("$s");
return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
int idx = key.indexOf("$s");
return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1
{{secKey = "$"}}
{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
* secondaryKey = {{"$"}}
* recordKey = {{"$"}}
h3. Example 2
{{secKey = "ab$cd"}}
{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
* secondaryKey = {{"ab$cd"}}
* recordKey = {{"e$f"}}
h3. Example 3
{{secKey = "$s"}}
{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
* secondaryKey = {{"$s"}}
* recordKey = {{"$s"}}
----
h3. ✔ Pros:
* {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.
* {*}Preserves human readability{*}: You can still identify the structure when
viewing raw keys in logs or S3.
* {*}No external dependencies{*}: Just string processing.
h3. ❌ Cons:
* {*}Custom escaping logic required{*}: More complicated logic to identify the
_first unescaped_ {{$}} reliably.
* {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong
(e.g., forgetting to unescape or double-escaping).
* {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find
the first _non-escaped_ delimiter.
h2. Sol 1.2 base 64 based encoding
# Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER +
Base64.encode(recordKey)`. Base64 does not map to $. So, this gives us a neat
and standard way to encode. Might not be very efficient for long strings? But,
base64 is a standard scheme.
# Escape special characters: `escapeSpecialChars(secondaryKey) + DELIMITER +
recordKey`. The keys are readable and preserves the order. This is a custom
scheme not used in other systems.
For null key we can use ASCII 0 as a special rule to encode.
h3. ✔ Pros:
* {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so
the separator is guaranteed safe.
* {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
* {*}Simple parsing{*}: Can safely use {{split()}} without risk.
* {*}Robust{*}: Handles any kind of string input, including control characters.
h3. ❌ Cons:
* {*}Longer keys{*}: Base64 encoding increases size by ~33%.
* {*}Less readable{*}: The keys are opaque.
* {*}S3 implications{*}: Longer object keys may slightly impact performance or
cost (e.g., S3 LIST operations), though this is usually minor.
> Secondary index bug
> -------------------
>
> Key: HUDI-9541
> URL: https://issues.apache.org/jira/browse/HUDI-9541
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Assignee: Davis Zhang
> Priority: Critical
>
>
> *{color:#ff0000}For reviewers to properly sign off, please comment with
> "approve solution X" in the comment section.{color}*
>
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the
> <sec key> part.
>
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> // the payload key is in the format of "secondaryKey$primaryKey"
> // we need to extract the secondary key from the payload key
> checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
> format for secondary index payload: " + key);
> int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
> return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>
> To make char like "$" to be special safely, we need to escape and unescape
> the keys.
> Otherwise for corner case like
> <sec key> = 0$1
>
> But what escape/unescape scheme we should use?
> h1. Impact
> Perf, maintainability
>
> h1. Proposal
> h2. Sol1: Escaping the string (current implmenetation)
>
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>
> For null key, just like how we make $ special, we pick ASCII 0 as the special
> char representing null str and escape ASCII 0 in normal string.
> https://issues.apache.org/jira/browse/HUDI-9543
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
> * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
> * Substring before → unescape to get secondary key.
> * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
> return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
> return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey)
> {
> return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> int idx = key.indexOf("$s");
> return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
> int idx = key.indexOf("$s");
> return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
> * secondaryKey = {{"$"}}
> * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
> * secondaryKey = {{"ab$cd"}}
> * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
> * secondaryKey = {{"$s"}}
> * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
> * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the
> key.
> * {*}Preserves human readability{*}: You can still identify the structure
> when viewing raw keys in logs or S3.
> * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
> * {*}Custom escaping logic required{*}: More complicated logic to identify
> the _first unescaped_ {{$}} reliably.
> * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong
> (e.g., forgetting to unescape or double-escaping).
> * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
> # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER +
> Base64.encode(recordKey)`. Base64 does not map to $. So, this gives us a
> neat and standard way to encode. Might not be very efficient for long
> strings? But, base64 is a standard scheme.
> # Escape special characters: `escapeSpecialChars(secondaryKey) + DELIMITER
> + recordKey`. The keys are readable and preserves the order. This is a custom
> scheme not used in other systems.
>
> For null key we can use ASCII 0 as a special rule to encode.
> h3. ✔ Pros:
> * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so
> the separator is guaranteed safe.
> * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
> * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
> * {*}Robust{*}: Handles any kind of string input, including control
> characters.
> h3. ❌ Cons:
> * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
> * {*}Less readable{*}: The keys are opaque.
> * {*}S3 implications{*}: Longer object keys may slightly impact performance
> or cost (e.g., S3 LIST operations), though this is usually minor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)