Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

via GitHub Thu, 16 May 2024 23:57:59 -0700


uros-db commented on code in PR #46511:
URL: https://github.com/apache/spark/pull/46511#discussion_r1604437739



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -34,6 +34,143 @@
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found
+   * when searching for a pattern string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix,
+   * with respect to the UTF8_BINARY_LCASE collation. The method assumes
+   * that the prefix is already lowercased prior to method call to avoid the
+   * overhead of calling .toLowerCase() multiple times on the same prefix 
string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int startPos) {
+    return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with
+   * the specified prefix, with respect to the UTF8_BINARY_LCASE collation.
+   * The method assumes that the prefix is already lowercased. The method only
+   * considers the part of target string that starts from the specified 
position.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the end position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
suffix in lowercase
+   */
+  public static int lowercaseMatchLengthFrom(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int startPos) {
+    assert startPos >= 0;
+    for (int len = 0; len <= target.numChars() - startPos; ++len) {
+      if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+        return len;
+      }
+    }

Review Comment:
   Yes, we'll address perf optimizations separately - so I'll create a ticket 
for that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate) [spark]

Reply via email to