(spark) branch master updated: [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)

wenchen Tue, 28 May 2024 10:05:31 -0700

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 249390017ef4 [SPARK-48221][SQL] Alter string search logic for 
UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)
249390017ef4 is described below

commit 249390017ef4a045037213dec386e16cca125080
Author: Uros Bojanic <[email protected]>
AuthorDate: Tue May 28 10:05:12 2024 -0700

    [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE 
collation (Contains, StartsWith, EndsWith, StringLocate)
    
    ### What changes were proposed in this pull request?
    String searching in UTF8_BINARY_LCASE now works on character-level, rather 
than on byte-level. For example: `contains("İ", "i");` now returns **false**, 
because there exists no `start, len` such that `lowercase(substring("İ", start, 
len)) == "i"`.
    
    ### Why are the changes needed?
    Fix functions that give unusable results due to one-to-many case mapping 
when performing string search under UTF8_BINARY_LCASE (see example above).
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, behaviour of `contains`, `startswith`, `endswith`, and 
`locate`/`position` expressions is changed for edge cases with one-to-many case 
mapping.
    
    ### How was this patch tested?
    New unit tests in `CollationSupportSuite`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes #46511 from uros-db/alter-lcase-impl.
    
    Authored-by: Uros Bojanic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .../catalyst/util/CollationAwareUTF8String.java    | 169 +++++++++++++++++++++
 .../spark/sql/catalyst/util/CollationSupport.java  |   8 +-
 .../org/apache/spark/unsafe/types/UTF8String.java  | 118 --------------
 .../spark/unsafe/types/CollationSupportSuite.java  | 129 +++++++++++++---
 .../apache/spark/unsafe/types/UTF8StringSuite.java | 105 -------------
 5 files changed, 278 insertions(+), 251 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
index ee0d611d7e65..0d0094d8d0a0 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -34,6 +34,155 @@ import java.util.Map;
  * Utility class for collation-aware UTF8String operations.
  */
 public class CollationAwareUTF8String {
+
+  /**
+   * The constant value to indicate that the match is not found when searching 
for a pattern
+   * string in a target string.
+   */
+  private static final int MATCH_NOT_FOUND = -1;
+
+  /**
+   * Returns whether the target string starts with the specified prefix, 
starting from the
+   * specified position (0-based index referring to character position in 
UTF8String), with respect
+   * to the UTF8_BINARY_LCASE collation. The method assumes that the prefix is 
already lowercased
+   * prior to method call to avoid the overhead of calling .toLowerCase() 
multiple times on the
+   * same prefix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return whether the target string starts with the specified prefix in 
UTF8_BINARY_LCASE
+   */
+  public static boolean lowercaseMatchFrom(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int startPos) {
+    return lowercaseMatchLengthFrom(target, lowercasePattern, startPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that starts with 
the specified
+   * prefix, starting from the specified position (0-based index referring to 
character position
+   * in UTF8String), with respect to the UTF8_BINARY_LCASE collation. The 
method assumes that the
+   * prefix is already lowercased. The method only considers the part of 
target string that
+   * starts from the specified (inclusive) position (that is, the method does 
not look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
prefix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return length of the target substring that starts with the specified 
prefix in lowercase
+   */
+  private static int lowercaseMatchLengthFrom(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int startPos) {
+    assert startPos >= 0;
+    for (int len = 0; len <= target.numChars() - startPos; ++len) {
+      if (target.substring(startPos, startPos + 
len).toLowerCase().equals(lowercasePattern)) {
+        return len;
+      }
+    }
+    return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param startPos the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  private static int lowercaseFind(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int startPos) {
+    assert startPos >= 0;
+    for (int i = startPos; i <= target.numChars(); ++i) {
+      if (lowercaseMatchFrom(target, lowercasePattern, i)) {
+        return i;
+      }
+    }
+    return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns whether the target string ends with the specified suffix, ending 
at the specified
+   * position (0-based index referring to character position in UTF8String), 
with respect to the
+   * UTF8_BINARY_LCASE collation. The method assumes that the suffix is 
already lowercased prior
+   * to method call to avoid the overhead of calling .toLowerCase() multiple 
times on the same
+   * suffix string.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return whether the target string ends with the specified suffix in 
lowercase
+   */
+  public static boolean lowercaseMatchUntil(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int endPos) {
+    return lowercaseMatchLengthUntil(target, lowercasePattern, endPos) != 
MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the length of the substring of the target string that ends with 
the specified
+   * suffix, ending at the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * suffix is already lowercased. The method only considers the part of 
target string that ends
+   * at the specified (non-inclusive) position (that is, the method does not 
look at UTF8
+   * characters of the target string at or after position `endPos`). If the 
suffix is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return length of the target substring that ends with the specified 
suffix in lowercase
+   */
+  private static int lowercaseMatchLengthUntil(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int endPos) {
+    assert endPos <= target.numChars();
+    for (int len = 0; len <= endPos; ++len) {
+      if (target.substring(endPos - len, 
endPos).toLowerCase().equals(lowercasePattern)) {
+        return len;
+      }
+    }
+    return MATCH_NOT_FOUND;
+  }
+
+  /**
+   * Returns the position of the last occurrence of the pattern string in the 
target string,
+   * ending at the specified position (0-based index referring to character 
position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. The method 
assumes that the
+   * pattern string is already lowercased prior to call. If the pattern is not 
found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param lowercasePattern the string to be searched for
+   * @param endPos the end position for searching (in the target string)
+   * @return the position of the last occurrence of pattern in target
+   */
+  private static int lowercaseRFind(
+      final UTF8String target,
+      final UTF8String lowercasePattern,
+      int endPos) {
+    assert endPos <= target.numChars();
+    for (int i = endPos; i >= 0; --i) {
+      if (lowercaseMatchUntil(target, lowercasePattern, i)) {
+        return i;
+      }
+    }
+    return MATCH_NOT_FOUND;
+  }
+
   public static UTF8String replace(final UTF8String src, final UTF8String 
search,
       final UTF8String replace, final int collationId) {
     // This collation aware implementation is based on existing implementation 
on UTF8String
@@ -183,6 +332,23 @@ public class CollationAwareUTF8String {
     return 0;
   }
 
+  /**
+   * Returns the position of the first occurrence of the pattern string in the 
target string,
+   * starting from the specified position (0-based index referring to 
character position in
+   * UTF8String), with respect to the UTF8_BINARY_LCASE collation. If the 
pattern is not found,
+   * MATCH_NOT_FOUND is returned.
+   *
+   * @param target the string to be searched in
+   * @param pattern the string to be searched for
+   * @param start the start position for searching (in the target string)
+   * @return the position of the first occurrence of pattern in target
+   */
+  public static int lowercaseIndexOf(final UTF8String target, final UTF8String 
pattern,
+      final int start) {
+    if (pattern.numChars() == 0) return 0;
+    return lowercaseFind(target, pattern.toLowerCase(), start);
+  }
+
   public static int indexOf(final UTF8String target, final UTF8String pattern,
       final int start, final int collationId) {
     if (pattern.numBytes() == 0) {
@@ -467,4 +633,7 @@ public class CollationAwareUTF8String {
     }
     return srcString.copyUTF8String(0, trimByteIdx);
   }
+
+  // TODO: Add more collation-aware UTF8String operations here.
+
 }
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
index bea3dc08b448..8f7aed30464c 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
@@ -118,7 +118,7 @@ public final class CollationSupport {
       return l.contains(r);
     }
     public static boolean execLowercase(final UTF8String l, final UTF8String 
r) {
-      return l.containsInLowerCase(r);
+      return CollationAwareUTF8String.lowercaseIndexOf(l, r, 0) >= 0;
     }
     public static boolean execICU(final UTF8String l, final UTF8String r,
         final int collationId) {
@@ -156,7 +156,7 @@ public final class CollationSupport {
       return l.startsWith(r);
     }
     public static boolean execLowercase(final UTF8String l, final UTF8String 
r) {
-      return l.startsWithInLowerCase(r);
+      return CollationAwareUTF8String.lowercaseMatchFrom(l, r.toLowerCase(), 
0);
     }
     public static boolean execICU(final UTF8String l, final UTF8String r,
         final int collationId) {
@@ -193,7 +193,7 @@ public final class CollationSupport {
       return l.endsWith(r);
     }
     public static boolean execLowercase(final UTF8String l, final UTF8String 
r) {
-      return l.endsWithInLowerCase(r);
+      return CollationAwareUTF8String.lowercaseMatchUntil(l, r.toLowerCase(), 
l.numChars());
     }
     public static boolean execICU(final UTF8String l, final UTF8String r,
         final int collationId) {
@@ -430,7 +430,7 @@ public final class CollationSupport {
     }
     public static int execLowercase(final UTF8String string, final UTF8String 
substring,
         final int start) {
-      return string.toLowerCase().indexOf(substring.toLowerCase(), start);
+      return CollationAwareUTF8String.lowercaseIndexOf(string, substring, 
start);
     }
     public static int execICU(final UTF8String string, final UTF8String 
substring, final int start,
         final int collationId) {
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index 20b26b6ebc5a..03286e063528 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -341,44 +341,6 @@ public final class UTF8String implements 
Comparable<UTF8String>, Externalizable,
     return false;
   }
 
-  /**
-   * Returns whether `this` contains `substring` in a lowercase unicode-aware 
manner
-   *
-   * This function is written in a way which avoids excessive allocations in 
case if we work with
-   * bare ASCII-character strings.
-   */
-  public boolean containsInLowerCase(final UTF8String substring) {
-    if (substring.numBytes == 0) {
-      return true;
-    }
-
-    // Both `this` and the `substring` are checked for non-ASCII characters, 
otherwise we would
-    // have to use `startsWithLowerCase(...)` in a loop, and it would 
frequently allocate
-    // (e.g. in case of `containsInLowerCase("1大1大1大...", "11")`)
-    if (!substring.isFullAscii()) {
-      return toLowerCase().contains(substring.toLowerCaseSlow());
-    }
-    if (!isFullAscii()) {
-      return toLowerCaseSlow().contains(substring.toLowerCaseAscii());
-    }
-
-    if (numBytes < substring.numBytes) {
-      return false;
-    }
-
-    final var firstLower = Character.toLowerCase(substring.getByte(0));
-    for (var i = 0; i <= (numBytes - substring.numBytes); i++) {
-      if (Character.toLowerCase(getByte(i)) == firstLower) {
-        final var rest = UTF8String.fromAddress(base, offset + i, numBytes - 
i);
-        if (rest.matchAtInLowerCaseAscii(substring, 0)) {
-          return true;
-        }
-      }
-    }
-
-    return false;
-  }
-
   /**
    * Returns the byte at position `i`.
    */
@@ -393,94 +355,14 @@ public final class UTF8String implements 
Comparable<UTF8String>, Externalizable,
     return ByteArrayMethods.arrayEquals(base, offset + pos, s.base, s.offset, 
s.numBytes);
   }
 
-  private boolean matchAtInLowerCaseAscii(final UTF8String s, int pos) {
-    if (s.numBytes + pos > numBytes || pos < 0) {
-      return false;
-    }
-
-    for (var i = 0; i < s.numBytes; i++) {
-      if (Character.toLowerCase(getByte(pos + i)) != 
Character.toLowerCase(s.getByte(i))) {
-        return false;
-      }
-    }
-
-    return true;
-  }
-
   public boolean startsWith(final UTF8String prefix) {
     return matchAt(prefix, 0);
   }
 
-  /**
-   * Checks whether `prefix` is a prefix of `this` in a lowercase 
unicode-aware manner
-   *
-   * This function is written in a way which avoids excessive allocations in 
case if we work with
-   * bare ASCII-character strings.
-   */
-  public boolean startsWithInLowerCase(final UTF8String prefix) {
-    // No way to match sizes of strings for early return, since single 
grapheme can be expanded
-    // into several independent ones in lowercase
-    if (prefix.numBytes == 0) {
-      return true;
-    }
-    if (numBytes == 0) {
-      return false;
-    }
-
-    if (!prefix.isFullAscii()) {
-      return toLowerCase().startsWith(prefix.toLowerCaseSlow());
-    }
-
-    final var part = prefix.numBytes >= numBytes ? this : 
UTF8String.fromAddress(
-      base, offset, prefix.numBytes);
-    if (!part.isFullAscii()) {
-      return toLowerCaseSlow().startsWith(prefix.toLowerCaseAscii());
-    }
-
-    if (numBytes < prefix.numBytes) {
-      return false;
-    }
-
-    return matchAtInLowerCaseAscii(prefix, 0);
-  }
-
   public boolean endsWith(final UTF8String suffix) {
     return matchAt(suffix, numBytes - suffix.numBytes);
   }
 
-  /**
-   * Checks whether `suffix` is a suffix of `this` in a lowercase 
unicode-aware manner
-   *
-   * This function is written in a way which avoids excessive allocations in 
case if we work with
-   * bare ASCII-character strings.
-   */
-  public boolean endsWithInLowerCase(final UTF8String suffix) {
-    // No way to match sizes of strings for early return, since single 
grapheme can be expanded
-    // into several independent ones in lowercase
-    if (suffix.numBytes == 0) {
-      return true;
-    }
-    if (numBytes == 0) {
-      return false;
-    }
-
-    if (!suffix.isFullAscii()) {
-      return toLowerCase().endsWith(suffix.toLowerCaseSlow());
-    }
-
-    final var part = suffix.numBytes >= numBytes ? this : 
UTF8String.fromAddress(
-      base, offset + numBytes - suffix.numBytes, suffix.numBytes);
-    if (!part.isFullAscii()) {
-      return toLowerCaseSlow().endsWith(suffix.toLowerCaseAscii());
-    }
-
-    if (numBytes < suffix.numBytes) {
-      return false;
-    }
-
-    return matchAtInLowerCaseAscii(suffix, numBytes - suffix.numBytes);
-  }
-
   /**
    * Returns the upper case of this string
    */
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
index 7fc3c4e349c3..eb18d7665b09 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
@@ -23,7 +23,7 @@ import org.junit.jupiter.api.Test;
 
 import static org.junit.jupiter.api.Assertions.*;
 
-
+// checkstyle.off: AvoidEscapedUnicodeCharacters
 public class CollationSupportSuite {
 
   /**
@@ -101,14 +101,6 @@ public class CollationSupportSuite {
     assertContains("ab世De", "AB世dE", "UNICODE_CI", true);
     assertContains("äbćδe", "ÄbćδE", "UNICODE_CI", true);
     assertContains("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
-    // Case-variable character length
-    assertContains("abİo12", "i̇o", "UNICODE_CI", true);
-    assertContains("abi̇o12", "İo", "UNICODE_CI", true);
-    assertContains("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-    assertContains("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-    assertContains("The İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", true);
-    assertContains("İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true);
-    assertContains("İodiNe", " i̇oDin", "UTF8_BINARY_LCASE", false);
     // Characters with the same binary lowercase representation
     assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
     assertContains("The Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
@@ -116,6 +108,33 @@ public class CollationSupportSuite {
     assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertContains("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertContains("The KKelvin.", "KKelvin,", "UTF8_BINARY_LCASE", false);
+    // Case-variable character length
+    assertContains("i̇", "i", "UNICODE_CI", false);
+    assertContains("i̇", "\u0307", "UNICODE_CI", false);
+    assertContains("i̇", "İ", "UNICODE_CI", true);
+    assertContains("İ", "i", "UNICODE_CI", false);
+    assertContains("adi̇os", "io", "UNICODE_CI", false);
+    assertContains("adi̇os", "Io", "UNICODE_CI", false);
+    assertContains("adi̇os", "i̇o", "UNICODE_CI", true);
+    assertContains("adi̇os", "İo", "UNICODE_CI", true);
+    assertContains("adİos", "io", "UNICODE_CI", false);
+    assertContains("adİos", "Io", "UNICODE_CI", false);
+    assertContains("adİos", "i̇o", "UNICODE_CI", true);
+    assertContains("adİos", "İo", "UNICODE_CI", true);
+    assertContains("i̇", "i", "UTF8_BINARY_LCASE", true); // != UNICODE_CI
+    assertContains("İ", "\u0307", "UTF8_BINARY_LCASE", false);
+    assertContains("İ", "i", "UTF8_BINARY_LCASE", false);
+    assertContains("i̇", "\u0307", "UTF8_BINARY_LCASE", true); // != UNICODE_CI
+    assertContains("i̇", "İ", "UTF8_BINARY_LCASE", true);
+    assertContains("İ", "i", "UTF8_BINARY_LCASE", false);
+    assertContains("adi̇os", "io", "UTF8_BINARY_LCASE", false);
+    assertContains("adi̇os", "Io", "UTF8_BINARY_LCASE", false);
+    assertContains("adi̇os", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertContains("adi̇os", "İo", "UTF8_BINARY_LCASE", true);
+    assertContains("adİos", "io", "UTF8_BINARY_LCASE", false);
+    assertContains("adİos", "Io", "UTF8_BINARY_LCASE", false);
+    assertContains("adİos", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertContains("adİos", "İo", "UTF8_BINARY_LCASE", true);
   }
 
   private void assertStartsWith(
@@ -191,13 +210,6 @@ public class CollationSupportSuite {
     assertStartsWith("ab世De", "AB世dE", "UNICODE_CI", true);
     assertStartsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true);
     assertStartsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
-    // Case-variable character length
-    assertStartsWith("İonic", "i̇o", "UNICODE_CI", true);
-    assertStartsWith("i̇onic", "İo", "UNICODE_CI", true);
-    assertStartsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-    assertStartsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-    assertStartsWith("İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", true);
-    assertStartsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false);
     // Characters with the same binary lowercase representation
     assertStartsWith("Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
     assertStartsWith("Kelvin.", "Kelvin", "UTF8_BINARY_LCASE", true);
@@ -205,6 +217,37 @@ public class CollationSupportSuite {
     assertStartsWith("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertStartsWith("2 Kelvin.", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertStartsWith("KKelvin.", "KKelvin,", "UTF8_BINARY_LCASE", false);
+    // Case-variable character length
+    assertStartsWith("i̇", "i", "UNICODE_CI", false);
+    assertStartsWith("i̇", "İ", "UNICODE_CI", true);
+    assertStartsWith("İ", "i", "UNICODE_CI", false);
+    assertStartsWith("İİİ", "i̇i̇", "UNICODE_CI", true);
+    assertStartsWith("İİİ", "i̇i", "UNICODE_CI", false);
+    assertStartsWith("İi̇İ", "i̇İ", "UNICODE_CI", true);
+    assertStartsWith("i̇İi̇i̇", "İi̇İi", "UNICODE_CI", false);
+    assertStartsWith("i̇onic", "io", "UNICODE_CI", false);
+    assertStartsWith("i̇onic", "Io", "UNICODE_CI", false);
+    assertStartsWith("i̇onic", "i̇o", "UNICODE_CI", true);
+    assertStartsWith("i̇onic", "İo", "UNICODE_CI", true);
+    assertStartsWith("İonic", "io", "UNICODE_CI", false);
+    assertStartsWith("İonic", "Io", "UNICODE_CI", false);
+    assertStartsWith("İonic", "i̇o", "UNICODE_CI", true);
+    assertStartsWith("İonic", "İo", "UNICODE_CI", true);
+    assertStartsWith("i̇", "i", "UTF8_BINARY_LCASE", true); // != UNICODE_CI
+    assertStartsWith("i̇", "İ", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("İ", "i", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("İİİ", "i̇i̇", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("İİİ", "i̇i", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("İi̇İ", "i̇İ", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("i̇İi̇i̇", "İi̇İi", "UTF8_BINARY_LCASE", true); // != 
UNICODE_CI
+    assertStartsWith("i̇onic", "io", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("i̇onic", "Io", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("i̇onic", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("i̇onic", "İo", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("İonic", "io", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("İonic", "Io", "UTF8_BINARY_LCASE", false);
+    assertStartsWith("İonic", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertStartsWith("İonic", "İo", "UTF8_BINARY_LCASE", true);
   }
 
   private void assertEndsWith(String pattern, String suffix, String 
collationName, boolean expected)
@@ -279,13 +322,6 @@ public class CollationSupportSuite {
     assertEndsWith("ab世De", "AB世dE", "UNICODE_CI", true);
     assertEndsWith("äbćδe", "ÄbćδE", "UNICODE_CI", true);
     assertEndsWith("äbćδe", "ÄBcΔÉ", "UNICODE_CI", false);
-    // Case-variable character length
-    assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);
-    assertEndsWith("The i̇o", "İo", "UNICODE_CI", true);
-    assertEndsWith("the İodine", "the i̇odine", "UTF8_BINARY_LCASE", true);
-    assertEndsWith("the i̇odine", "the İodine", "UTF8_BINARY_LCASE", true);
-    assertEndsWith("The İodiNe", "i̇oDine", "UTF8_BINARY_LCASE", true);
-    assertEndsWith("The İodiNe", "i̇oDin", "UTF8_BINARY_LCASE", false);
     // Characters with the same binary lowercase representation
     assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
     assertEndsWith("The Kelvin", "Kelvin", "UTF8_BINARY_LCASE", true);
@@ -293,6 +329,38 @@ public class CollationSupportSuite {
     assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertEndsWith("The 2 Kelvin", "2 Kelvin", "UTF8_BINARY_LCASE", true);
     assertEndsWith("The KKelvin", "KKelvin,", "UTF8_BINARY_LCASE", false);
+    // Case-variable character length
+    assertEndsWith("i̇", "\u0307", "UNICODE_CI", false);
+    assertEndsWith("i̇", "İ", "UNICODE_CI", true);
+    assertEndsWith("İ", "i", "UNICODE_CI", false);
+    assertEndsWith("İİİ", "i̇i̇", "UNICODE_CI", true);
+    assertEndsWith("İİİ", "ii̇", "UNICODE_CI", false);
+    assertEndsWith("İi̇İ", "İi̇", "UNICODE_CI", true);
+    assertEndsWith("i̇İi̇i̇", "\u0307İi̇İ", "UNICODE_CI", false);
+    assertEndsWith("the i̇o", "io", "UNICODE_CI", false);
+    assertEndsWith("the i̇o", "Io", "UNICODE_CI", false);
+    assertEndsWith("the i̇o", "i̇o", "UNICODE_CI", true);
+    assertEndsWith("the i̇o", "İo", "UNICODE_CI", true);
+    assertEndsWith("the İo", "io", "UNICODE_CI", false);
+    assertEndsWith("the İo", "Io", "UNICODE_CI", false);
+    assertEndsWith("the İo", "i̇o", "UNICODE_CI", true);
+    assertEndsWith("the İo", "İo", "UNICODE_CI", true);
+    assertEndsWith("i̇", "\u0307", "UTF8_BINARY_LCASE", true); // != UNICODE_CI
+    assertEndsWith("i̇", "İ", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("İ", "\u0307", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("İİİ", "i̇i̇", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("İİİ", "ii̇", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("İi̇İ", "İi̇", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("i̇İi̇i̇", "\u0307İi̇İ", "UTF8_BINARY_LCASE", true); // != 
UNICODE_CI
+    assertEndsWith("i̇İi̇i̇", "\u0307İİ", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("the i̇o", "io", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("the i̇o", "Io", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("the i̇o", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("the i̇o", "İo", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("the İo", "io", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("the İo", "Io", "UTF8_BINARY_LCASE", false);
+    assertEndsWith("the İo", "i̇o", "UTF8_BINARY_LCASE", true);
+    assertEndsWith("the İo", "İo", "UTF8_BINARY_LCASE", true);
   }
 
   private void assertStringSplitSQL(String str, String delimiter, String 
collationName,
@@ -709,12 +777,24 @@ public class CollationSupportSuite {
     assertLocate("大千", "test大千世界大千世界", 9, "UNICODE_CI", 9);
     assertLocate("大千", "大千世界大千世界", 1, "UNICODE_CI", 1);
     // Case-variable character length
+    assertLocate("\u0307", "i̇", 1, "UTF8_BINARY", 2);
+    assertLocate("\u0307", "İ", 1, "UTF8_BINARY_LCASE", 0); // != UTF8_BINARY
+    assertLocate("i", "i̇", 1, "UNICODE_CI", 0);
+    assertLocate("\u0307", "i̇", 1, "UNICODE_CI", 0);
+    assertLocate("i̇", "i", 1, "UNICODE_CI", 0);
+    assertLocate("İ", "i̇", 1, "UNICODE_CI", 1);
+    assertLocate("İ", "i", 1, "UNICODE_CI", 0);
+    assertLocate("i", "i̇", 1, "UTF8_BINARY_LCASE", 1); // != UNICODE_CI
+    assertLocate("\u0307", "i̇", 1, "UTF8_BINARY_LCASE", 2); // != UNICODE_CI
+    assertLocate("i̇", "i", 1, "UTF8_BINARY_LCASE", 0);
+    assertLocate("İ", "i̇", 1, "UTF8_BINARY_LCASE", 1);
+    assertLocate("İ", "i", 1, "UTF8_BINARY_LCASE", 0);
     assertLocate("i̇o", "İo世界大千世界", 1, "UNICODE_CI", 1);
     assertLocate("i̇o", "大千İo世界大千世界", 1, "UNICODE_CI", 3);
     assertLocate("i̇o", "世界İo大千世界大千İo", 4, "UNICODE_CI", 11);
     assertLocate("İo", "i̇o世界大千世界", 1, "UNICODE_CI", 1);
     assertLocate("İo", "大千i̇o世界大千世界", 1, "UNICODE_CI", 3);
-    assertLocate("İo", "世界i̇o大千世界大千i̇o", 4, "UNICODE_CI", 12); // 12 instead 
of 11
+    assertLocate("İo", "世界i̇o大千世界大千i̇o", 4, "UNICODE_CI", 12);
   }
 
   private void assertSubstringIndex(String string, String delimiter, Integer 
count,
@@ -1008,3 +1088,4 @@ public class CollationSupportSuite {
   // TODO: Test other collation-aware expressions.
 
 }
+// checkstyle.on: AvoidEscapedUnicodeCharacters
diff --git 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
index a1aba86cfbc5..0188297fd05a 100644
--- 
a/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
+++ 
b/common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
@@ -215,43 +215,6 @@ public class UTF8StringSuite {
     assertFalse(fromString("大千世界").contains(fromString("大千世界好")));
   }
 
-  @Test
-  public void containsInLowerCase() {
-    // Corner cases
-    assertTrue(EMPTY_UTF8.containsInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("a").containsInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("A").containsInLowerCase(fromString("a")));
-    assertTrue(fromString("a").containsInLowerCase(fromString("A")));
-    assertFalse(EMPTY_UTF8.containsInLowerCase(fromString("a")));
-    // ASCII
-    assertTrue(fromString("hello").containsInLowerCase(fromString("ello")));
-    assertFalse(fromString("hello").containsInLowerCase(fromString("vello")));
-    
assertFalse(fromString("hello").containsInLowerCase(fromString("hellooo")));
-    // Unicode
-    assertTrue(fromString("大千世界").containsInLowerCase(fromString("千世界")));
-    assertFalse(fromString("大千世界").containsInLowerCase(fromString("世千")));
-    assertFalse(fromString("大千世界").containsInLowerCase(fromString("大千世界好")));
-    // ASCII lowercase
-    assertTrue(fromString("HeLlO").containsInLowerCase(fromString("ElL")));
-    assertFalse(fromString("HeLlO").containsInLowerCase(fromString("ElLoO")));
-    // Unicode lowercase
-    assertTrue(fromString("ЯбЛоКо").containsInLowerCase(fromString("БлОк")));
-    assertFalse(fromString("ЯбЛоКо").containsInLowerCase(fromString("лОкБ")));
-    // Characters with the same binary lowercase representation
-    assertTrue(fromString("The 
Kelvin.").containsInLowerCase(fromString("Kelvin")));
-    assertTrue(fromString("The 
Kelvin.").containsInLowerCase(fromString("Kelvin")));
-    assertTrue(fromString("The 
KKelvin.").containsInLowerCase(fromString("KKelvin")));
-    assertTrue(fromString("2 Kelvin.").containsInLowerCase(fromString("2 
Kelvin")));
-    assertTrue(fromString("2 Kelvin.").containsInLowerCase(fromString("2 
Kelvin")));
-    assertFalse(fromString("The 
KKelvin.").containsInLowerCase(fromString("KKelvin,")));
-    // Characters with longer binary lowercase representation
-    assertTrue(fromString("the İodine").containsInLowerCase(fromString("the 
i̇odine")));
-    assertTrue(fromString("the i̇odine").containsInLowerCase(fromString("the 
İodine")));
-    assertTrue(fromString("The İodiNe").containsInLowerCase(fromString(" 
i̇oDin")));
-    assertTrue(fromString("İodiNe").containsInLowerCase(fromString("i̇oDin")));
-    assertFalse(fromString("İodiNe").containsInLowerCase(fromString(" 
i̇oDin")));
-  }
-
   @Test
   public void startsWith() {
     assertTrue(EMPTY_UTF8.startsWith(EMPTY_UTF8));
@@ -263,40 +226,6 @@ public class UTF8StringSuite {
     assertFalse(fromString("大千世界").startsWith(fromString("大千世界好")));
   }
 
-  @Test
-  public void startsWithInLowerCase() {
-    // Corner cases
-    assertTrue(EMPTY_UTF8.startsWithInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("a").startsWithInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("A").startsWithInLowerCase(fromString("a")));
-    assertTrue(fromString("a").startsWithInLowerCase(fromString("A")));
-    assertFalse(EMPTY_UTF8.startsWithInLowerCase(fromString("a")));
-    // ASCII
-    assertTrue(fromString("hello").startsWithInLowerCase(fromString("hell")));
-    assertFalse(fromString("hello").startsWithInLowerCase(fromString("ell")));
-    // Unicode
-    assertTrue(fromString("大千世界").startsWithInLowerCase(fromString("大千")));
-    assertFalse(fromString("大千世界").startsWithInLowerCase(fromString("世千")));
-    // ASCII lowercase
-    assertTrue(fromString("HeLlO").startsWithInLowerCase(fromString("hElL")));
-    assertFalse(fromString("HeLlO").startsWithInLowerCase(fromString("ElL")));
-    // Unicode lowercase
-    
assertTrue(fromString("ЯбЛоКо").startsWithInLowerCase(fromString("яБлОк")));
-    
assertFalse(fromString("ЯбЛоКо").startsWithInLowerCase(fromString("БлОк")));
-    // Characters with the same binary lowercase representation
-    
assertTrue(fromString("Kelvin.").startsWithInLowerCase(fromString("Kelvin")));
-    
assertTrue(fromString("Kelvin.").startsWithInLowerCase(fromString("Kelvin")));
-    
assertTrue(fromString("KKelvin.").startsWithInLowerCase(fromString("KKelvin")));
-    assertTrue(fromString("2 Kelvin.").startsWithInLowerCase(fromString("2 
Kelvin")));
-    assertTrue(fromString("2 Kelvin.").startsWithInLowerCase(fromString("2 
Kelvin")));
-    
assertFalse(fromString("KKelvin.").startsWithInLowerCase(fromString("KKelvin,")));
-    // Characters with longer binary lowercase representation
-    assertTrue(fromString("the İodine").startsWithInLowerCase(fromString("the 
i̇odine")));
-    assertTrue(fromString("the i̇odine").startsWithInLowerCase(fromString("the 
İodine")));
-    
assertTrue(fromString("İodiNe").startsWithInLowerCase(fromString("i̇oDin")));
-    assertFalse(fromString("The 
İodiNe").startsWithInLowerCase(fromString("i̇oDin")));
-  }
-
   @Test
   public void endsWith() {
     assertTrue(EMPTY_UTF8.endsWith(EMPTY_UTF8));
@@ -308,40 +237,6 @@ public class UTF8StringSuite {
     assertFalse(fromString("数据砖头").endsWith(fromString("我的数据砖头")));
   }
 
-  @Test
-  public void endsWithInLowerCase() {
-    // Corner cases
-    assertTrue(EMPTY_UTF8.endsWithInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("a").endsWithInLowerCase(EMPTY_UTF8));
-    assertTrue(fromString("A").endsWithInLowerCase(fromString("a")));
-    assertTrue(fromString("a").endsWithInLowerCase(fromString("A")));
-    assertFalse(EMPTY_UTF8.endsWithInLowerCase(fromString("a")));
-    // ASCII
-    assertTrue(fromString("hello").endsWithInLowerCase(fromString("ello")));
-    assertFalse(fromString("hello").endsWithInLowerCase(fromString("hell")));
-    // Unicode
-    assertTrue(fromString("大千世界").endsWithInLowerCase(fromString("世界")));
-    assertFalse(fromString("大千世界").endsWithInLowerCase(fromString("大千")));
-    // ASCII lowercase
-    assertTrue(fromString("HeLlO").endsWithInLowerCase(fromString("ElLo")));
-    assertFalse(fromString("HeLlO").endsWithInLowerCase(fromString("hElL")));
-    // Unicode lowercase
-    assertTrue(fromString("ЯбЛоКо").endsWithInLowerCase(fromString("БлОкО")));
-    assertFalse(fromString("ЯбЛоКо").endsWithInLowerCase(fromString("яБлОк")));
-    // Characters with the same binary lowercase representation
-    assertTrue(fromString("The 
Kelvin").endsWithInLowerCase(fromString("Kelvin")));
-    assertTrue(fromString("The 
Kelvin").endsWithInLowerCase(fromString("Kelvin")));
-    assertTrue(fromString("The 
KKelvin").endsWithInLowerCase(fromString("KKelvin")));
-    assertTrue(fromString("The 2 Kelvin").endsWithInLowerCase(fromString("2 
Kelvin")));
-    assertTrue(fromString("The 2 Kelvin").endsWithInLowerCase(fromString("2 
Kelvin")));
-    assertFalse(fromString("The 
KKelvin").endsWithInLowerCase(fromString("KKelvin,")));
-    // Characters with longer binary lowercase representation
-    assertTrue(fromString("the İodine").endsWithInLowerCase(fromString("the 
i̇odine")));
-    assertTrue(fromString("the i̇odine").endsWithInLowerCase(fromString("the 
İodine")));
-    assertTrue(fromString("The 
İodiNe").endsWithInLowerCase(fromString("i̇oDine")));
-    assertFalse(fromString("The 
İodiNe").endsWithInLowerCase(fromString("i̇oDin")));
-  }
-
   @Test
   public void substring() {
     assertEquals(EMPTY_UTF8, fromString("hello").substring(0, 0));


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-48221][SQL] Alter string search logic for UTF8_BINARY_LCASE collation (Contains, StartsWith, EndsWith, StringLocate)

Reply via email to