Davis Zhang created HUDI-9541:
---------------------------------
Summary: Secondary index bug
Key: HUDI-9541
URL: https://issues.apache.org/jira/browse/HUDI-9541
Project: Apache Hudi
Issue Type: Bug
Components: index
Reporter: Davis Zhang
# What's the issue
Here we assume the input is <sec key><separator><record key> and extracts the
<sec key> part.
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
// the payload key is in the format of "secondaryKey$primaryKey"
// we need to extract the secondary key from the payload key
checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
format for secondary index payload: " + key);
int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".
This code piece incorrectly assumes <sec key> does not contain any "$" as its
content. Otherwise for input like
<sec key> = 0$1
the function returns the "0" as sec key which is apparently wrong.
# Impact
Data correctness and data corruption issue.
# Proposed fix
## Escaping the string
If we want to stick to the trick of separating concatenated string with some
magic separator, we have to escape the $ of <sec Key> to something else. So
Write path:
start with sec key 0$1
when write sec key content, we escape $ to be $$. So that there can never be a
standalone $ after escaping the string.
Write it as
0$$1
When concatenating, it would be
0$$1$<record key>
later if we want to extract, then
* find the first $ which is not followed by another $.
* extract the prefix
* unescape the extracted part back - replacing all $$ to $
## Don't extract stuff from concatenated string
We either store the 2 parts <sec key> <record key> in separate logical units
Or all callers only gives the <sec key> so the function never needs to extract
it from a concatenated string. This can imply a larger code refactor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)