Sagar Sumit created HUDI-8489:
---------------------------------

             Summary: Fix encoding of secondary index key
                 Key: HUDI-8489
                 URL: https://issues.apache.org/jira/browse/HUDI-8489
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Sagar Sumit
             Fix For: 1.0.0


Secondary index key is a combination of secondaryKey and recordKey. There are 
two ways to encode with a delimiter ($):
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

Ran a benchmark to compare encoding/decoding time and did not find much 
difference - https://gist.github.com/codope/b1c73abed748d77c0b4db974d835f9da



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to