[
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9541:
------------------------------
Reviewers: sivabalan narayanan, Y Ethan Guo
> Secondary index bug
> -------------------
>
> Key: HUDI-9541
> URL: https://issues.apache.org/jira/browse/HUDI-9541
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Assignee: Davis Zhang
> Priority: Critical
>
>
> *{color:#ff0000}For reviewers to properly sign off, please comment with
> "approve solution X" in the comment section.{color}*
>
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the
> <sec key> part.
>
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> // the payload key is in the format of "secondaryKey$primaryKey"
> // we need to extract the secondary key from the payload key
> checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
> format for secondary index payload: " + key);
> int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
> return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>
>
> This code piece incorrectly assumes <sec key> does not contain any "$" as its
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>
> h1. Impact
> Data correctness and data corruption issue as this function can return
> garbage data.
>
> h1. Proposed fix
> h2. Sol1: Escaping the string
>
> h3. 🔧 Escape Rules:
> ||Element||Encoding||
> |{{$}} in key|{{"$d"}}|
> |Delimiter|{{"$s"}}|
>
> h3. 🛠 Encoding
> {{encodedKey = escape(secondaryKey) + "$s" + escape(recordKey)}}
> h3. 🛠 Decoding
> * Find the *first occurrence* of {{"$s"}} (guaranteed to be the delimiter).
> * Substring before → unescape to get secondary key.
> * Substring after → unescape to get record key.
> ----
> h2. ✅ *Functions*
> {code:java}
> // Escape "$" as "$d"
> public static String escape(String key) {
> return key.replace("$", "$d");
> }
> // Reverse "$d" back to "$"
> public static String unescape(String key) {
> return key.replace("$d", "$");
> }
> // Encode full key
> public static String encodeSecondaryIndexKey(String secKey, String recordKey)
> {
> return escape(secKey) + "$s" + escape(recordKey);
> }
> // Decode secondary key
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> int idx = key.indexOf("$s");
> return unescape(key.substring(0, idx));
> }
> // Decode primary key
> public static String getRecordKeyFromSecondaryIndexKey(String key) {
> int idx = key.indexOf("$s");
> return unescape(key.substring(idx + 2));
> } {code}
> ----
> h2. ✅ *Examples*
> h3. Example 1
> {{secKey = "$"}}
> {{recordKey = "$"}}
> *Encoded:* {{"$d$s$d"}}
> *Decoded:*
> * secondaryKey = {{"$"}}
> * recordKey = {{"$"}}
> h3. Example 2
> {{secKey = "ab$cd"}}
> {{recordKey = "e$f"}}
> *Encoded:* {{"ab$dcd$se$df"}}
> *Decoded:*
> * secondaryKey = {{"ab$cd"}}
> * recordKey = {{"e$f"}}
> h3. Example 3
> {{secKey = "$s"}}
> {{recordKey = "$s"}}
> *Encoded:* {{""$ds$s$ds"}}
> *Decoded:*
> * secondaryKey = {{"$s"}}
> * recordKey = {{"$s"}}
> ----
> h3. ✔ Pros:
> * {*}Minimal overhead{*}: Adds at most 1 extra character per {{$}} in the
> key.
> * {*}Preserves human readability{*}: You can still identify the structure
> when viewing raw keys in logs or S3.
> * {*}Lexicographical order is preserved{*}: {{$}} and {{$$}} behave
> predictably in ordering.
> * {*}No external dependencies{*}: Just string processing.
> h3. ❌ Cons:
> * {*}Custom escaping logic required{*}: More complicated logic to identify
> the _first unescaped_ {{$}} reliably.
> * {*}Risk of future maintenance bugs{*}: It's easy to get escaping wrong
> (e.g., forgetting to unescape or double-escaping).
> * {*}Cannot use naive {{split()}}{*}: Must implement a stateful parser to
> find the first _non-escaped_ delimiter.
> h2. Sol 1.2 base 64 based encoding
> # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER +
> recordKey`. Base64 does not map to $. So, this gives us a neat and standard
> way to encode. Might not be very efficient for long strings? But, base64 is a
> standard scheme.
> # Escape special characters: `escapeSpecialChars(secondaryKey) + DELIMITER
> + recordKey`. The keys are readable and preserves the order. This is a custom
> scheme not used in other systems.
>
> Encoding recordkey is not a must have, as we just need to find the first $
> sign in that case.
> h3. ✔ Pros:
> * {*}No ambiguity or delimiter conflicts{*}: Base64 never uses {{{}${}}}, so
> the separator is guaranteed safe.
> * {*}Standard, well-tested scheme{*}: Easy to encode/decode using libraries.
> * {*}Simple parsing{*}: Can safely use {{split()}} without risk.
> * {*}Robust{*}: Handles any kind of string input, including control
> characters.
> h3. ❌ Cons:
> * {*}Longer keys{*}: Base64 encoding increases size by ~33%.
> * {*}Less readable{*}: The keys are opaque.
> * {*}Lexicographical order not preserved{*}: Since Base64 encoding distorts
> ordering, you lose natural sorting behavior.
> * {*}S3 implications{*}: Longer object keys may slightly impact performance
> or cost (e.g., S3 LIST operations), though this is usually minor.
>
> Regarding "{*}Lexicographical order not preserved"{*} we need to be careful
> when implementing. {color:#ff0000}*Both reader and writer will assume keys
> are sorted based on their encoded version. It means, we always encode before
> sorting for lookup and write*{color}
>
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to
> extract it from a concatenated string. This can imply a larger code refactor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)