Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE [spark]

via GitHub Mon, 22 Apr 2024 07:21:24 -0700


cloud-fan commented on code in PR #46082:
URL: https://github.com/apache/spark/pull/46082#discussion_r1574845082



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -359,10 +414,97 @@ public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  /**
+   * Checks whether `prefix` is a prefix of `this` in a lowercase 
unicode-aware manner
+   *
+   * This function is written in a way which avoids excessive allocations in 
case if we work with
+   * bare ASCII-character strings.
+   */
+  public boolean startsWithInLowerCase(final UTF8String prefix) {
+    // No way to match sizes of strings for early return, since single 
grapheme can be expanded
+    // into several independent ones in lowercase
+    if (prefix.numBytes == 0) {
+      return true;
+    }
+    if (numBytes == 0) {
+      return false;
+    }
+
+    for (var i = 0; i < prefix.numBytes; i++) {
+      final var lhsByte = getByte(i);
+      final var rhsByte = prefix.getByte(i);
+      if (lhsByte < 0 || rhsByte < 0) {

Review Comment:
   just for my curiosity: is it more efficient to do the ascii check ahead 
return earlier for the slow path? then we can make the fast path loop tighter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase unicode-aware contains, startsWith and endsWith and optimize UTF8_BINARY_LCASE [spark]

Reply via email to