[ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-9541:
------------------------------
    Description: 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

I
h2. ✅ *Revised Solution: Prefix-Safe Escaping*
h3. 🔧 Escape Rules:
||Element||Encoding||
|{{$}} in key|{{"$d"}}|
|Delimiter|{{"$s"}}|
 
h3. 🛠 Encoding
 
java
CopyEdit
{{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
h3. 🛠 Decoding
 * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).

 * Substring before → unescape to get secondary key.

 * Substring after → unescape to get record key.

----
h2. ✅ *Functions*
 
java
CopyEdit
{{// Escape "$" as "$d"public static String escape(String key) {  return 
key.replace("$", "$d");
}// Reverse "$d" back to "$"public static String unescape(String key) {  return 
key.replace("$d", "$");
}// Encode full keypublic static String encodeSecondaryIndexKey(String secKey, 
String recordKey) {  return escape(secKey) + "$s" + escape(recordKey);
}// Decode secondary keypublic static String 
getSecondaryKeyFromSecondaryIndexKey(String key) {  int idx = 
key.indexOf("$s");  return unescape(key.substring(0, idx));
}// Decode primary keypublic static String 
getRecordKeyFromSecondaryIndexKey(String key) {  int idx = key.indexOf("$s");  
return unescape(key.substring(idx + 2));
}}}
----
h2. ✅ *Examples*
h3. Example 1
{{secKey = "$"recordKey = "$"}}
*Encoded:* {{"$d$s$d"}}
*Decoded:*
 * secondaryKey = {{"$"}}

 * recordKey = {{"$"}}

----
h3. Example 2
{{secKey = "ab$cd"recordKey = "e$f"}}
*Encoded:* {{"ab$dcd$s" + "e$df"}}
*Decoded:*
 * secondaryKey = {{"ab$cd"}}

 * recordKey = {{"e$f"}}

----
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.

 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.

 * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
predictably in ordering.

 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

 

Also the base64 encoding of recordkey is not a must have, as we just need to 
find the first $ sign in that case.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.

 * {*}Less readable{*}: The keys are opaque.

 * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
ordering, you lose natural sorting behavior.

 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.

  was:
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

If we want to stick to the trick of separating concatenated string with some 
magic separator, we have to escape the $ of <sec Key> to something else. So

 

Write path:

start with sec key 0$1

when write sec key content, we escape $ to be $$. So that there can never be a 
standalone $ after escaping the string.

Write it as

0$$1

When concatenating, it would be

0$$1$<record key>

later if we want to extract, then
 * find the first $ which is not followed by another $.
 * extract the prefix
 * unescape the extracted part back - replacing all $$ to $

Read path follow similar patterns. Do escape before we concatenate and unescape 
when we extract properly.

 

This is a data layout change

 
h3. ✔ Pros:
 * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the key.

 * {*}Preserves human readability{*}: You can still identify the structure when 
viewing raw keys in logs or S3.

 * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
predictably in ordering.

 * {*}No external dependencies{*}: Just string processing.

h3. ❌ Cons:
 * {*}Custom escaping logic required{*}: More complicated logic to identify the 
_first unescaped_ {{$}} reliably.

 * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
(e.g., forgetting to unescape or double-escaping).

 * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to find 
the first _non-escaped_ delimiter.

h2. Sol 1.2 base 64 based encoding
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

 

Also the base64 encoding of recordkey is not a must have, as we just need to 
find the first $ sign in that case.
h3. ✔ Pros:
 * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
the separator is guaranteed safe.

 * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.

 * {*}Simple parsing{*}: Can safely use {{split()}} without risk.

 * {*}Robust{*}: Handles any kind of string input, including control characters.

h3. ❌ Cons:
 * {*}Longer keys{*}: Base64 encoding increases size by ~33%.

 * {*}Less readable{*}: The keys are opaque.

 * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
ordering, you lose natural sorting behavior.

 * {*}S3 implications{*}: Longer object keys may slightly impact performance or 
cost (e.g., S3 LIST operations), though this is usually minor.

h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.


> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Priority: Critical
>
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
>  
> This code piece incorrectly assumes <sec key> does not contain any "$" as its 
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> h1. Impact
> Data correctness and data corruption issue as this function can return 
> garbage data.
>  
> h1. Proposed fix
> h2. Sol1: Escaping the string
> I
> h2. ✅ *Revised Solution: Prefix-Safe Escaping*
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>  
> h3. 🛠 Encoding
>  
> java
> CopyEdit
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
>  * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
>  * Substring before → unescape to get secondary key.
>  * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
>  
> java
> CopyEdit
> {{// Escape "$" as "$d"public static String escape(String key) {  return 
> key.replace("$", "$d");
> }// Reverse "$d" back to "$"public static String unescape(String key) {  
> return key.replace("$d", "$");
> }// Encode full keypublic static String encodeSecondaryIndexKey(String 
> secKey, String recordKey) {  return escape(secKey) + "$s" + escape(recordKey);
> }// Decode secondary keypublic static String 
> getSecondaryKeyFromSecondaryIndexKey(String key) {  int idx = 
> key.indexOf("$s");  return unescape(key.substring(0, idx));
> }// Decode primary keypublic static String 
> getRecordKeyFromSecondaryIndexKey(String key) {  int idx = key.indexOf("$s"); 
>  return unescape(key.substring(idx + 2));
> }}}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
>  * secondaryKey = {{"$"}}
>  * recordKey = {{"$"}}
> ----
> h3. Example 2
> {{secKey = "ab$cd"recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$s" + "e$df"}}
> *Decoded:*
>  * secondaryKey = {{"ab$cd"}}
>  * recordKey = {{"e$f"}}
> ----
> h3. ✔ Pros:
>  * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the 
> key.
>  * {*}Preserves human readability{*}: You can still identify the structure 
> when viewing raw keys in logs or S3.
>  * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
> predictably in ordering.
>  * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
>  * {*}Custom escaping logic required{*}: More complicated logic to identify 
> the _first unescaped_ {{$}} reliably.
>  * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
> (e.g., forgetting to unescape or double-escaping).
>  * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to 
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a 
> neat and standard way to encode. Might not be very efficient for long 
> strings? But, base64 is a standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + escapeSpecialChars(recordKey)`. The keys are readable and preserves the 
> order. This is a custom scheme not used in other systems.
>  
> Also the base64 encoding of recordkey is not a must have, as we just need to 
> find the first $ sign in that case.
> h3. ✔ Pros:
>  * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
> the separator is guaranteed safe.
>  * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
>  * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
>  * {*}Robust{*}: Handles any kind of string input, including control 
> characters.
> h3. ❌ Cons:
>  * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
>  * {*}Less readable{*}: The keys are opaque.
>  * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
> ordering, you lose natural sorting behavior.
>  * {*}S3 implications{*}: Longer object keys may slightly impact performance 
> or cost (e.g., S3 LIST operations), though this is usually minor.
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units 
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to 
> extract it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to