Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE [spark]

via GitHub Fri, 19 Apr 2024 09:04:31 -0700


dbatomic commented on code in PR #46082:
URL: https://github.com/apache/spark/pull/46082#discussion_r1572603566



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -359,10 +414,97 @@ public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  /**
+   * Checks whether `prefix` is a prefix of `this` in a lowercase 
unicode-aware manner
+   *
+   * This function is written in a way which avoids excessive allocations in 
case if we work with
+   * bare ASCII-character strings.
+   */
+  public boolean startsWithInLowerCase(final UTF8String prefix) {
+    // No way to match sizes of strings for early return, since single 
grapheme can be expanded
+    // into several independent ones in lowercase
+    if (prefix.numBytes == 0) {
+      return true;
+    }
+    if (numBytes == 0) {
+      return false;
+    }
+
+    for (var i = 0; i < prefix.numBytes; i++) {

Review Comment:
   Would it be possible to share implementation between startsWithInLowerCase 
and endsWithInLowerCase?
   
   E.g. if you were to use two variables (offset in prefix and offset in this) 
you should be able to share the code between these two.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE [spark]

Reply via email to