[ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-9541:
------------------------------
    Affects Version/s: 1.1.0

> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Priority: Critical
>
>  
> # What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
>  
> This code piece incorrectly assumes <sec key> does not contain any "$" as its 
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> # Impact
> Data correctness and data corruption issue.
>  
> # Proposed fix
> ## Escaping the string
> If we want to stick to the trick of separating concatenated string with some 
> magic separator, we have to escape the $ of <sec Key> to something else. So
>  
> Write path:
> start with sec key 0$1
> when write sec key content, we escape $ to be $$. So that there can never be 
> a standalone $ after escaping the string.
> Write it as
> 0$$1
> When concatenating, it would be
> 0$$1$<record key>
> later if we want to extract, then
>  * find the first $ which is not followed by another $.
>  * extract the prefix
>  * unescape the extracted part back - replacing all $$ to $
>  
> ## Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units
> Or all callers only gives the <sec key> so the function never needs to 
> extract it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to