Davis Zhang created HUDI-9541:
---------------------------------

             Summary: Secondary index bug
                 Key: HUDI-9541
                 URL: https://issues.apache.org/jira/browse/HUDI-9541
             Project: Apache Hudi
          Issue Type: Bug
          Components: index
            Reporter: Davis Zhang


 

# What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 

# Impact

Data correctness and data corruption issue.

 

# Proposed fix

## Escaping the string

If we want to stick to the trick of separating concatenated string with some 
magic separator, we have to escape the $ of <sec Key> to something else. So

 

Write path:

start with sec key 0$1

when write sec key content, we escape $ to be $$. So that there can never be a 
standalone $ after escaping the string.

Write it as

0$$1

When concatenating, it would be

0$$1$<record key>

later if we want to extract, then
 * find the first $ which is not followed by another $.
 * extract the prefix
 * unescape the extracted part back - replacing all $$ to $

 

## Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to