[jira] [Updated] (HUDI-9541) Secondary index bug

Davis Zhang (Jira) Fri, 20 Jun 2025 11:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9541:
------------------------------
    Description: 
 
h1. What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 
h1. Impact

Data correctness and data corruption issue as this function can return garbage 
data.

 
h1. Proposed fix
h2. Sol1: Escaping the string

If we want to stick to the trick of separating concatenated string with some 
magic separator, we have to escape the $ of <sec Key> to something else. So

 

Write path:

start with sec key 0$1

when write sec key content, we escape $ to be $$. So that there can never be a 
standalone $ after escaping the string.

Write it as

0$$1

When concatenating, it would be

0$$1$<record key>

later if we want to extract, then
 * find the first $ which is not followed by another $.
 * extract the prefix
 * unescape the extracted part back - replacing all $$ to $

Read path follow similar patterns. Do escape before we concatenate and unescape 
when we extract properly.

 

This is a data layout change
h2. Sol 2: Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units - 
this is a data layout change

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.

  was:
 

# What's the issue

Here we assume the input is <sec key><separator><record key> and extracts the 
<sec key> part.

 
{code:java}
public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
  // the payload key is in the format of "secondaryKey$primaryKey"
  // we need to extract the secondary key from the payload key
  checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
format for secondary index payload: " + key);
  int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
  return unescapeSpecialChars(key.substring(0, delimiterIndex));
} {code}
The separator is "$".

 

 

This code piece incorrectly assumes <sec key> does not contain any "$" as its 
content. Otherwise for input like

<sec key> = 0$1

the function returns the "0" as sec key which is apparently wrong.

 

# Impact

Data correctness and data corruption issue.

 

# Proposed fix

## Escaping the string

If we want to stick to the trick of separating concatenated string with some 
magic separator, we have to escape the $ of <sec Key> to something else. So

 

Write path:

start with sec key 0$1

when write sec key content, we escape $ to be $$. So that there can never be a 
standalone $ after escaping the string.

Write it as

0$$1

When concatenating, it would be

0$$1$<record key>

later if we want to extract, then
 * find the first $ which is not followed by another $.
 * extract the prefix
 * unescape the extracted part back - replacing all $$ to $

 

## Don't extract stuff from concatenated string

We either store the 2 parts <sec key> <record key> in separate logical units

Or all callers only gives the <sec key> so the function never needs to extract 
it from a concatenated string. This can imply a larger code refactor.


> Secondary index bug
> -------------------
>
>                 Key: HUDI-9541
>                 URL: https://issues.apache.org/jira/browse/HUDI-9541
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Priority: Critical
>
>  
> h1. What's the issue
> Here we assume the input is <sec key><separator><record key> and extracts the 
> <sec key> part.
>  
> {code:java}
> public static String getSecondaryKeyFromSecondaryIndexKey(String key) {
>   // the payload key is in the format of "secondaryKey$primaryKey"
>   // we need to extract the secondary key from the payload key
>   checkState(key.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR), "Invalid key 
> format for secondary index payload: " + key);
>   int delimiterIndex = getSecondaryIndexKeySeparatorPosition(key);
>   return unescapeSpecialChars(key.substring(0, delimiterIndex));
> } {code}
> The separator is "$".
>  
>  
> This code piece incorrectly assumes <sec key> does not contain any "$" as its 
> content. Otherwise for input like
> <sec key> = 0$1
> the function returns the "0" as sec key which is apparently wrong.
>  
> h1. Impact
> Data correctness and data corruption issue as this function can return 
> garbage data.
>  
> h1. Proposed fix
> h2. Sol1: Escaping the string
> If we want to stick to the trick of separating concatenated string with some 
> magic separator, we have to escape the $ of <sec Key> to something else. So
>  
> Write path:
> start with sec key 0$1
> when write sec key content, we escape $ to be $$. So that there can never be 
> a standalone $ after escaping the string.
> Write it as
> 0$$1
> When concatenating, it would be
> 0$$1$<record key>
> later if we want to extract, then
>  * find the first $ which is not followed by another $.
>  * extract the prefix
>  * unescape the extracted part back - replacing all $$ to $
> Read path follow similar patterns. Do escape before we concatenate and 
> unescape when we extract properly.
>  
> This is a data layout change
> h2. Sol 2: Don't extract stuff from concatenated string
> We either store the 2 parts <sec key> <record key> in separate logical units 
> - this is a data layout change
> Or all callers only gives the <sec key> so the function never needs to 
> extract it from a concatenated string. This can imply a larger code refactor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9541) Secondary index bug

Reply via email to