[
https://issues.apache.org/jira/browse/HUDI-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan closed HUDI-9543.
-------------------------------------
Resolution: Fixed
https://github.com/apache/hudi/commit/43c521310733eb7f332b86e201bfb1961a2e3149
> Secondary index readers and writers do not handle null char properly
> --------------------------------------------------------------------
>
> Key: HUDI-9543
> URL: https://issues.apache.org/jira/browse/HUDI-9543
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Assignee: Davis Zhang
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.1.0
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Attempted fix and spot more issue
> https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2
> I wrote test exposing all SI bugs when null values kicks in. Some code
> involves MDT write path which require deep dive. It will take some unknown
> amount of time to get clarity on how it can be fixed e2e.
>
>
> If secondary key column contains null, the SI does not track those records.
>
> Plan: Escape some char and use the unescaped version as the note for null str
> Use Null character (ASCII 0) to represent null
> for normal string, escape Null character (ASCII 0) to '\' + Null character
> (ASCII 0)
> Make sure reader and writer path works for this
>
>
> h1. Hash value computation
> As of today, the hash value computation is based on its unescaped value (raw
> value).
>
> {code:java}
> public static int mapRecordKeyToFileGroupIndex(
> String recordKey, int numFileGroups, String partitionName,
> HoodieIndexVersion version) {
> if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
> && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
> && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
> return
> mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
> numFileGroups);
> }
> return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
> }
> // change to configurable larger group
> private static int mapRecordKeyToFileGroupIndex(String recordKey, int
> numFileGroups) {
> int h = 0;
> for (int i = 0; i < recordKey.length(); ++i) {
> h = 31 * h + recordKey.charAt(i);
> }
> return Math.abs(Math.abs(h) % numFileGroups);
> } {code}
>
> When calculating the hash value, if the key is null, we hit NPE. We need to
> fix this.
>
> h3. Solution 1 - hard coded value
> So it is proposed to give a hard coded hash value "0" for null. If it is
> null, return 0.
>
> Pros:
> * does not come with the over head of escaping. Even after escape, it just
> gives another different fixed coded hash value of string "."
> * easy to change, scope is well under control.
>
> h3. Solution 2 - escape first then hash
> null value is first escaped as ".", then calculate the hash value.
> Cons:
> * comes with the overhead of escaping whenever we want to get the hash
> value, for all strings.
> * Does not give much difference as we still end up with some fixed hash
> value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)