Re: [PR] [SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BINARY collation [spark]

via GitHub Sat, 31 Aug 2024 08:49:04 -0700


viktorluc-db commented on code in PR #47771:
URL: https://github.com/apache/spark/pull/47771#discussion_r1739762711



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -550,6 +549,152 @@ public static UTF8String toTitleCase(final UTF8String 
target, final int collatio
       BreakIterator.getWordInstance(locale)));
   }
 
+  /**
+   * This 'HashMap' is introduced as a performance speedup. Since title-casing 
a codepoint can
+   * result in more than a single codepoint, for correctness, we would use
+   * 'UCharacter.toTitleCase(String)' which returns a 'String'. If we use
+   * 'UCharacter.toTitleCase(int)' (the version of the same function which 
converts a single
+   * codepoint to its title-case codepoint), it would be faster than the 
previously mentioned
+   * version, but the problem here is that we don't handle when title-casing a 
codepoint yields more
+   * than 1 codepoint. Since there are only 48 codepoints that are mapped to 
more than 1 codepoint

Review Comment:
   I tested it locally. With n=1e8 get/indexOf queries to the 
HashMap/ArrayList, per run.
   The HashMap was faster in every run. Hashmap was always around 1.2 seconds, 
and ArrayList was always around 1.6 seconds. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BINARY collation [spark]

Reply via email to