vladimirg-db commented on code in PR #46082:
URL: https://github.com/apache/spark/pull/46082#discussion_r1572665195


##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -359,10 +414,97 @@ public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
+  /**
+   * Checks whether `prefix` is a prefix of `this` in a lowercase 
unicode-aware manner
+   *
+   * This function is written in a way which avoids excessive allocations in 
case if we work with
+   * bare ASCII-character strings.
+   */
+  public boolean startsWithInLowerCase(final UTF8String prefix) {
+    // No way to match sizes of strings for early return, since single 
grapheme can be expanded
+    // into several independent ones in lowercase
+    if (prefix.numBytes == 0) {
+      return true;
+    }
+    if (numBytes == 0) {
+      return false;
+    }
+
+    for (var i = 0; i < prefix.numBytes; i++) {

Review Comment:
   I wanted to do something like this, but I think that this simple 
implementation is better for the next reasons:
   - We couldn't do something like `matchAt(...)` by using an offset in `this`, 
because we actually don't know the exact offset (at least without the full 
conversion to lowercase), since lowercase representation of the unicode string 
can actually be longer (or shorter) than the original. That's why to check the 
prefix the loop is forward and to check the suffix the loop is backward
   - We probably could do a parametrized function, which does something like 
`startOffset + this.offset + direction * i`, which would work for the loop 
indexing, but I can't think of the suitable math for the inner `if` from the 
top of my head. It's probably possible, but would be much more involved, since 
that code differs significantly in terms of indexing
   - We could just substitute the loop body with the closure, and pass it as a 
strategy (since that's what differs), but introducing looping virtual call in 
an optimization PR wouldn't be reasonable
   - It reads trivially, which is good for the reader
   - The copy-paste here is not that large



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to