[jira] [Updated] (HUDI-9541) Secondary index bug

Davis Zhang (Jira) Tue, 24 Jun 2025 16:36:15 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9541:
------------------------------
    Description: 
 
*{color:#ff0000}For reviewers to properly sign off, please comment with 
"approve solution X" in the comment section.{color}*

 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

To make char like "$" to be special safely, we need to escape and unescape the 
keys.

Otherwise for corner case like

<sec key> = 0$1

 

But what escape/unescape scheme we should use?
h1. Impact

Perf, maintainability

 
h1. Proposal
h2. Sol1: Escaping the string (current implmenetation)

 
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 

For null key, just like how we make $ special, we pick ASCII 0 as the special 
char representing null str and escape ASCII 0 in normal string.

https://issues.apache.org/jira/browse/HUDI-9543
h3. 🛠 Encoding

{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"}}

{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

h3. Example 2

{{secKey = "ab$cd"}}

{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

h3. Example 3

{{secKey = "$s"}}

{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
 * secondaryKey = {{"$s"}}

 * recordKey = {{"$s"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.
 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.
 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
recordKey`. The keys are readable and preserves the order. This is a custom 
scheme not used in other systems.

 

For null key we can use ASCII 0 as a special rule to encode.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
 * {*}Less readable{*}: The keys are opaque.
 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

  was:
 
*{color:#ff0000}For reviewers to properly sign off, please comment with 
"approve solution X" in the comment section.{color}*

 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

To make char like "$" to be special safely, we need to escape and unescape the 
keys.

Otherwise for corner case like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 

But what escape/unescape scheme we should use?
h1. Impact

Perf, maintainability

 
h1. Proposal
h2. Sol1: Escaping the string (current implmenetation)

 
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 

For null key, just like how we make $ special, we pick ASCII 0 as the special 
char representing null str and escape ASCII 0 in normal string.

https://issues.apache.org/jira/browse/HUDI-9543
h3. 🛠 Encoding

{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"}}

{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

h3. Example 2

{{secKey = "ab$cd"}}

{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

h3. Example 3

{{secKey = "$s"}}

{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
 * secondaryKey = {{"$s"}}

 * recordKey = {{"$s"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.
 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.
 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
recordKey`. The keys are readable and preserves the order. This is a custom 
scheme not used in other systems.

 

For null key we can use ASCII 0 as a special rule to encode.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
 * {*}Less readable{*}: The keys are opaque.
 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.


> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Assignee: Davis Zhang
>            Priority: Critical
>
>  
> *{color:#ff0000}For reviewers to properly sign off, please comment with 
> "approve solution X" in the comment section.{color}*
>  
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
> To make char like "$" to be special safely, we need to escape and unescape 
> the keys.
> Otherwise for corner case like
> <sec key> = 0$1
>  
> But what escape/unescape scheme we should use?
> h1. Impact
> Perf, maintainability
>  
> h1. Proposal
> h2. Sol1: Escaping the string (current implmenetation)
>  
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>  
> For null key, just like how we make $ special, we pick ASCII 0 as the special 
> char representing null str and escape ASCII 0 in normal string.
> https://issues.apache.org/jira/browse/HUDI-9543
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
>  * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
>  * Substring before → unescape to get secondary key.
>  * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
>   return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
>   return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey) 
> {
>   return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
>  * secondaryKey = {{"$"}}
>  * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
>  * secondaryKey = {{"ab$cd"}}
>  * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
>  * secondaryKey = {{"$s"}}
>  * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
>  * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the 
> key.
>  * {*}Preserves human readability{*}: You can still identify the structure 
> when viewing raw keys in logs or S3.
>  * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
>  * {*}Custom escaping logic required{*}: More complicated logic to identify 
> the _first unescaped_ {{$}} reliably.
>  * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
> (e.g., forgetting to unescape or double-escaping).
>  * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to 
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a 
> neat and standard way to encode. Might not be very efficient for long 
> strings? But, base64 is a standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + recordKey`. The keys are readable and preserves the order. This is a custom 
> scheme not used in other systems.
>  
> For null key we can use ASCII 0 as a special rule to encode.
> h3. ✔ Pros:
>  * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
> the separator is guaranteed safe.
>  * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
>  * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
>  * {*}Robust{*}: Handles any kind of string input, including control 
> characters.
> h3. ❌ Cons:
>  * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
>  * {*}Less readable{*}: The keys are opaque.
>  * {*}S3 implications{*}: Longer object keys may slightly impact performance 
> or cost (e.g., S3 LIST operations), though this is usually minor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9541) Secondary index bug

Reply via email to