[jira] [Updated] (HUDI-9541) Secondary index bug

Davis Zhang (Jira) Fri, 20 Jun 2025 13:36:06 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9541:
------------------------------
    Description: 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

 
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 
h3. 🛠 Encoding

{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}
// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}
// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}
// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}
// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"}}

{{recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

h3. Example 2

{{secKey = "ab$cd"}}

{{recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$se$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

h3. Example 3

{{secKey = "$s"}}

{{recordKey = "$s"}}
*Encoded:* {{""$ds$s$ds"}}
*Decoded:*
 * secondaryKey = {{"$s"}}

 * recordKey = {{"$s"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.

 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.

 * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
predictably in ordering.

 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

 

Also the base64 encoding of recordkey is not a must have, as we just need to 
find the first $ sign in that case.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.

 * {*}Less readable{*}: The keys are opaque.

 * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
ordering, you lose natural sorting behavior.

 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.

  was:
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

I
h2. ✅ *Revised Solution: Prefix-Safe Escaping*
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|

 
h3. 🛠 Encoding


{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
{code:java}
// Escape "$" as "$d"
public static String escape(String key) {
  return key.replace("$", "$d");
}// Reverse "$d" back to "$"
public static String unescape(String key) {
  return key.replace("$d", "$");
}// Encode full key
public static String encodeSecondaryIndexKey(String secKey, String recordKey) {
  return escape(secKey) + "$s" + escape(recordKey);
}// Decode secondary key
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(0, idx));
}// Decode primary key
public static String getRecordKeyFromSecondaryIndexKey(String key) {
  int idx = key.indexOf("$s");
  return unescape(key.substring(idx + 2));
} {code}
----
h2. ✅ *Examples*
h3. Example 1

{{secKey = "$"recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

----
h3. Example 2

{{secKey = "ab$cd"recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$s" + "e$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.

 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.

 * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
predictably in ordering.

 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

 

Also the base64 encoding of recordkey is not a must have, as we just need to 
find the first $ sign in that case.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.

 * {*}Less readable{*}: The keys are opaque.

 * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
ordering, you lose natural sorting behavior.

 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.


> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Priority: Critical
>
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
>  
> This code piece incorrectly assumes <sec key> does not contain any "$" as its 
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> h1. Impact
> Data correctness and data corruption issue as this function can return 
> garbage data.
>  
> h1. Proposed fix
> h2. Sol1: Escaping the string
>  
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>  
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
>  * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
>  * Substring before → unescape to get secondary key.
>  * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
>   return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
>   return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey) 
> {
>   return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
>  * secondaryKey = {{"$"}}
>  * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
>  * secondaryKey = {{"ab$cd"}}
>  * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
>  * secondaryKey = {{"$s"}}
>  * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
>  * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the 
> key.
>  * {*}Preserves human readability{*}: You can still identify the structure 
> when viewing raw keys in logs or S3.
>  * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
> predictably in ordering.
>  * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
>  * {*}Custom escaping logic required{*}: More complicated logic to identify 
> the _first unescaped_ {{$}} reliably.
>  * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
> (e.g., forgetting to unescape or double-escaping).
>  * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to 
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a 
> neat and standard way to encode. Might not be very efficient for long 
> strings? But, base64 is a standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + escapeSpecialChars(recordKey)`. The keys are readable and preserves the 
> order. This is a custom scheme not used in other systems.
>  
> Also the base64 encoding of recordkey is not a must have, as we just need to 
> find the first $ sign in that case.
> h3. ✔ Pros:
>  * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
> the separator is guaranteed safe.
>  * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
>  * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
>  * {*}Robust{*}: Handles any kind of string input, including control 
> characters.
> h3. ❌ Cons:
>  * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
>  * {*}Less readable{*}: The keys are opaque.
>  * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
> ordering, you lose natural sorting behavior.
>  * {*}S3 implications{*}: Longer object keys may slightly impact performance 
> or cost (e.g., S3 LIST operations), though this is usually minor.
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units 
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to 
> extract it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9541) Secondary index bug

Reply via email to