[ 
https://issues.apache.org/jira/browse/HUDI-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-9543.
-------------------------------------
    Resolution: Fixed

https://github.com/apache/hudi/commit/43c521310733eb7f332b86e201bfb1961a2e3149

> Secondary index readers and writers do not handle null char properly
> --------------------------------------------------------------------
>
>                 Key: HUDI-9543
>                 URL: https://issues.apache.org/jira/browse/HUDI-9543
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Assignee: Davis Zhang
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Attempted fix and spot more issue
> https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2
> I wrote test exposing all SI bugs when null values kicks in. Some code 
> involves MDT write path which require deep dive. It will take some unknown 
> amount of time to get clarity on how it can be fixed e2e.
>  
>  
> If secondary key column contains null, the SI does not track those records.
>  
> Plan: Escape some char and use the unescaped version as the note for null str
> Use Null character (ASCII 0) to represent null
> for normal string, escape Null character (ASCII 0) to '\' + Null character 
> (ASCII 0)
> Make sure reader and writer path works for this
>  
>  
> h1. Hash value computation
> As of today, the hash value computation is based on its unescaped value (raw 
> value). 
>  
> {code:java}
> public static int mapRecordKeyToFileGroupIndex(
>     String recordKey, int numFileGroups, String partitionName, 
> HoodieIndexVersion version) {
>   if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
>       && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
>       && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
>     return 
> mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
>  numFileGroups);
>   }
>   return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
> }
> // change to configurable larger group
> private static int mapRecordKeyToFileGroupIndex(String recordKey, int 
> numFileGroups) {
>   int h = 0;
>   for (int i = 0; i < recordKey.length(); ++i) {
>     h = 31 * h + recordKey.charAt(i);
>   }
>   return Math.abs(Math.abs(h) % numFileGroups);
> } {code}
>  
> When calculating the hash value, if the key is null, we hit NPE. We need to 
> fix this.
>  
> h3. Solution 1 - hard coded value
> So it is proposed to give a hard coded hash value "0" for null. If it is 
> null, return 0.
>  
> Pros:
>  * does not come with the over head of escaping. Even after escape, it just 
> gives another different fixed coded hash value of string "."
>  * easy to change, scope is well under control.
>  
> h3. Solution 2 - escape first then hash
> null value is first escaped as ".", then calculate the hash value.
> Cons:
>  * comes with the overhead of escaping whenever we want to get the hash 
> value, for all strings.
>  * Does not give much difference as we still end up with some fixed hash 
> value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to