[
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9541:
------------------------------
Description:
h1. What's the issue
Here we assume the input is <sec key><separator><record key> and extracts the
<sec key> part.
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
// the payload key is in the format of "secondaryKey$primaryKey"
// we need to extract the secondary key from the payload key
checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
format for secondary index payload: " + key);
int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".
This code piece incorrectly assumes <sec key> does not contain any "$" as its
content. Otherwise for input like
<sec key> = 0$1
the function returns the "0" as sec key which is apparently wrong.
h1. Impact
Data correctness and data corruption issue as this function can return garbage
data.
h1. Proposed fix
h2. Sol1: Escaping the string
If we want to stick to the trick of separating concatenated string with some
magic separator, we have to escape the $ of <sec Key> to something else. So
Write path:
start with sec key 0$1
when write sec key content, we escape $ to be $$. So that there can never be a
standalone $ after escaping the string.
Write it as
0$$1
When concatenating, it would be
0$$1$<record key>
later if we want to extract, then
* find the first $ which is not followed by another $.
* extract the prefix
* unescape the extracted part back - replacing all $$ to $
Read path follow similar patterns. Do escape before we concatenate and unescape
when we extract properly.
This is a data layout change
h2. Sol 2: Don't extract stuff from concatenated string
We either store the 2 parts <sec key> <record key> in separate logical units -
this is a data layout change
Or all callers only gives the <sec key> so the function never needs to extract
it from a concatenated string. This can imply a larger code refactor.
was:
# What's the issue
Here we assume the input is <sec key><separator><record key> and extracts the
<sec key> part.
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
// the payload key is in the format of "secondaryKey$primaryKey"
// we need to extract the secondary key from the payload key
checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
format for secondary index payload: " + key);
int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".
This code piece incorrectly assumes <sec key> does not contain any "$" as its
content. Otherwise for input like
<sec key> = 0$1
the function returns the "0" as sec key which is apparently wrong.
# Impact
Data correctness and data corruption issue.
# Proposed fix
## Escaping the string
If we want to stick to the trick of separating concatenated string with some
magic separator, we have to escape the $ of <sec Key> to something else. So
Write path:
start with sec key 0$1
when write sec key content, we escape $ to be $$. So that there can never be a
standalone $ after escaping the string.
Write it as
0$$1
When concatenating, it would be
0$$1$<record key>
later if we want to extract, then
* find the first $ which is not followed by another $.
* extract the prefix
* unescape the extracted part back - replacing all $$ to $
## Don't extract stuff from concatenated string
We either store the 2 parts <sec key> <record key> in separate logical units
Or all callers only gives the <sec key> so the function never needs to extract
it from a concatenated string. This can imply a larger code refactor.
> Secondary index bug
> -------------------
>
> Key: HUDI-9541
> URL: https://issues.apache.org/jira/browse/HUDI-9541
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Priority: Critical
>
>
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the
> <sec key> part.
>
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
> // the payload key is in the format of "secondaryKey$primaryKey"
> // we need to extract the secondary key from the payload key
> checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key
> format for secondary index payload: " + key);
> int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
> return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>
>
> This code piece incorrectly assumes <sec key> does not contain any "$" as its
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>
> h1. Impact
> Data correctness and data corruption issue as this function can return
> garbage data.
>
> h1. Proposed fix
> h2. Sol1: Escaping the string
> If we want to stick to the trick of separating concatenated string with some
> magic separator, we have to escape the $ of <sec Key> to something else. So
>
> Write path:
> start with sec key 0$1
> when write sec key content, we escape $ to be $$. So that there can never be
> a standalone $ after escaping the string.
> Write it as
> 0$$1
> When concatenating, it would be
> 0$$1$<record key>
> later if we want to extract, then
> * find the first $ which is not followed by another $.
> * extract the prefix
> * unescape the extracted part back - replacing all $$ to $
> Read path follow similar patterns. Do escape before we concatenate and
> unescape when we extract properly.
>
> This is a data layout change
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to
> extract it from a concatenated string. This can imply a larger code refactor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)