[
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9541:
------------------------------
Affects Version/s: 1.1.0
> Secondary index bug
> -------------------
>
> Key: HUDI-9541
> URL: https://issues.apache.org/jira/browse/HUDI-9541
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Priority: Critical
>
>
> # What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the
> <sec key> part.
>
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> // the payload key is in the format of "secondaryKey$primaryKey"
> // we need to extract the secondary key from the payload key
> checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
> format for secondary index payload: " + key);
> int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
> return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>
>
> This code piece incorrectly assumes <sec key> does not contain any "$" as its
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>
> # Impact
> Data correctness and data corruption issue.
>
> # Proposed fix
> ## Escaping the string
> If we want to stick to the trick of separating concatenated string with some
> magic separator, we have to escape the $ of <sec Key> to something else. So
>
> Write path:
> start with sec key 0$1
> when write sec key content, we escape $ to be $$. So that there can never be
> a standalone $ after escaping the string.
> Write it as
> 0$$1
> When concatenating, it would be
> 0$$1$<record key>
> later if we want to extract, then
> * find the first $ which is not followed by another $.
> * extract the prefix
> * unescape the extracted part back - replacing all $$ to $
>
> ## Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units
> Or all callers only gives the <sec key> so the function never needs to
> extract it from a concatenated string. This can imply a larger code refactor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)