[ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-9541:
------------------------------
    Status: In Progress  (was: Open)

> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Assignee: Davis Zhang
>            Priority: Critical
>
>  
> *{color:#ff0000}For reviewers to properly sign off, please comment with 
> "approve solution X" in the comment section.{color}*
>  
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
>  
> This code piece incorrectly assumes <sec key> does not contain any "$" as its 
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> h1. Impact
> Data correctness and data corruption issue as this function can return 
> garbage data.
>  
> h1. Proposed fix
> h2. Sol1: Escaping the string
>  
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>  
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
>  * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
>  * Substring before → unescape to get secondary key.
>  * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
>   return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
>   return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey) 
> {
>   return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
>   int idx = key.indexOf("$s");
>   return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
>  * secondaryKey = {{"$"}}
>  * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
>  * secondaryKey = {{"ab$cd"}}
>  * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
>  * secondaryKey = {{"$s"}}
>  * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
>  * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the 
> key.
>  * {*}Preserves human readability{*}: You can still identify the structure 
> when viewing raw keys in logs or S3.
>  * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave 
> predictably in ordering.
>  * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
>  * {*}Custom escaping logic required{*}: More complicated logic to identify 
> the _first unescaped_ {{$}} reliably.
>  * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong 
> (e.g., forgetting to unescape or double-escaping).
>  * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to 
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> recordKey`.  Base64 does not map to $. So, this gives us a neat and standard 
> way to encode. Might not be very efficient for long strings? But, base64 is a 
> standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + recordKey`. The keys are readable and preserves the order. This is a custom 
> scheme not used in other systems.
>  
> Encoding recordkey is not a must have, as we just need to find the first $ 
> sign in that case.
> h3. ✔ Pros:
>  * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so 
> the separator is guaranteed safe.
>  * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
>  * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
>  * {*}Robust{*}: Handles any kind of string input, including control 
> characters.
> h3. ❌ Cons:
>  * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
>  * {*}Less readable{*}: The keys are opaque.
>  * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts 
> ordering, you lose natural sorting behavior.
>  * {*}S3 implications{*}: Longer object keys may slightly impact performance 
> or cost (e.g., S3 LIST operations), though this is usually minor.
>  
> Regarding "{*}Lexicographical order not preserved"{*} we need to be careful 
> when implementing. {color:#ff0000}*Both reader and writer will assume keys 
> are sorted based on their encoded version. It means, we always encode before 
> sorting for lookup and write*{color}
>  
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units 
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to 
> extract it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to