cloud-fan commented on code in PR #46082:
URL: https://github.com/apache/spark/pull/46082#discussion_r1574845082
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -359,10 +414,97 @@ public boolean startsWith(final UTF8String prefix) {
return matchAt(prefix, 0);
}
+ /**
+ * Checks whether `prefix` is a prefix of `this` in a lowercase
unicode-aware manner
+ *
+ * This function is written in a way which avoids excessive allocations in
case if we work with
+ * bare ASCII-character strings.
+ */
+ public boolean startsWithInLowerCase(final UTF8String prefix) {
+ // No way to match sizes of strings for early return, since single
grapheme can be expanded
+ // into several independent ones in lowercase
+ if (prefix.numBytes == 0) {
+ return true;
+ }
+ if (numBytes == 0) {
+ return false;
+ }
+
+ for (var i = 0; i < prefix.numBytes; i++) {
+ final var lhsByte = getByte(i);
+ final var rhsByte = prefix.getByte(i);
+ if (lhsByte < 0 || rhsByte < 0) {
Review Comment:
just for my curiosity: is it more efficient to do the ascii check ahead
return earlier for the slow path? then we can make the fast path loop tighter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]