[jira] [Updated] (HUDI-9541) Secondary index bug

Davis Zhang (Jira) Tue, 24 Jun 2025 18:51:10 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9541:
------------------------------
    Description: 
 
*{color:#ff0000}For reviewers to properly sign off, please comment with 
"approve solution X" in the comment section.{color}*

 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

To make char like "$" to be special safely, we need to escape and unescape the 
keys.

Otherwise for corner case like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 

But what escape/unescape scheme we should use?
h1. Impact

Perf, maintainability

 
h1. Proposal
h2. Sol1: Escaping the string (current implmenetation)

 
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 

For null key, just like how we make $ special, we pick ASCII 0 as the special 
char representing null str and escape ASCII 0 in normal string.

https://issues.apache.org/jira/browse/HUDI-9543
h3. 🛠 Encoding

{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"}}

{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

h3. Example 2

{{secKey = "ab$cd"}}

{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

h3. Example 3

{{secKey = "$s"}}

{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
 * secondaryKey = {{"$s"}}

 * recordKey = {{"$s"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.
 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.
 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
recordKey`. The keys are readable and preserves the order. This is a custom 
scheme not used in other systems.

 

For null key we can use ASCII 0 as a special rule to encode.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
 * {*}Less readable{*}: The keys are opaque.
 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

  was:
 
*{color:#ff0000}For reviewers to properly sign off, please comment with 
"approve solution X" in the comment section.{color}*

 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

 
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 
h3. 🛠 Encoding

{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"}}

{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

h3. Example 2

{{secKey = "ab$cd"}}

{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

h3. Example 3

{{secKey = "$s"}}

{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
 * secondaryKey = {{"$s"}}

 * recordKey = {{"$s"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.

 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.

 * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
predictably in ordering.

 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + recordKey`.  
Base64 does not map to $. So, this gives us a neat and standard way to encode. 
Might not be very efficient for long strings? But, base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
recordKey`. The keys are readable and preserves the order. This is a custom 
scheme not used in other systems.

 

Encoding recordkey is not a must have, as we just need to find the first $ sign 
in that case.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.

 * {*}Less readable{*}: The keys are opaque.

 * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
ordering, you lose natural sorting behavior.

 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

 

Regarding "{*}Lexicographical order not preserved"{*} we need to be careful 
when implementing. {color:#ff0000}*Both reader and writer will assume keys are 
sorted based on their encoded version. It means, we always encode before 
sorting for lookup and write*{color}

 
h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.


> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Assignee: Davis Zhang
>            Priority: Critical
>
>  
> *{color:#ff0000}For reviewers to properly sign off, please comment with 
> "approve solution X" in the comment section.{color}*
>  
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
> To make char like "$" to be special safely, we need to escape and unescape 
> the keys.
> Otherwise for corner case like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> But what escape/unescape scheme we should use?
> h1. Impact
> Perf, maintainability
>  
> h1. Proposal
> h2. Sol1: Escaping the string (current implmenetation)
>  
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>  
> For null key, just like how we make $ special, we pick ASCII 0 as the special 
> char representing null str and escape ASCII 0 in normal string.
> https://issues.apache.org/jira/browse/HUDI-9543
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
>  * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
>  * Substring before → unescape to get secondary key.
>  * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
>   return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
>   return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey) 
> {
>   return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
>  * secondaryKey = {{"$"}}
>  * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
>  * secondaryKey = {{"ab$cd"}}
>  * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
>  * secondaryKey = {{"$s"}}
>  * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
>  * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the 
> key.
>  * {*}Preserves human readability{*}: You can still identify the structure 
> when viewing raw keys in logs or S3.
>  * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
>  * {*}Custom escaping logic required{*}: More complicated logic to identify 
> the _first unescaped_ {{$}} reliably.
>  * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
> (e.g., forgetting to unescape or double-escaping).
>  * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to 
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a 
> neat and standard way to encode. Might not be very efficient for long 
> strings? But, base64 is a standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + recordKey`. The keys are readable and preserves the order. This is a custom 
> scheme not used in other systems.
>  
> For null key we can use ASCII 0 as a special rule to encode.
> h3. ✔ Pros:
>  * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
> the separator is guaranteed safe.
>  * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
>  * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
>  * {*}Robust{*}: Handles any kind of string input, including control 
> characters.
> h3. ❌ Cons:
>  * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
>  * {*}Less readable{*}: The keys are opaque.
>  * {*}S3 implications{*}: Longer object keys may slightly impact performance 
> or cost (e.g., S3 LIST operations), though this is usually minor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9541) Secondary index bug

Reply via email to