Re: [PR] [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations [spark]

via GitHub Sun, 14 Jul 2024 04:41:34 -0700


uros-db commented on code in PR #46762:
URL: https://github.com/apache/spark/pull/46762#discussion_r1677116591



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -841,117 +842,255 @@ public static UTF8String translate(final UTF8String 
input,
     return UTF8String.fromString(sb.toString());
   }
 
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to the UTF8_LCASE collation. String trimming is 
performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrim(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
-    if (trimString == null) {
-      return null;
-    }
+    return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
 
-    UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-    return lowercaseTrimRight(leftTrimmed, trimString);
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to all ICU collations in Spark. String trimming 
is performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim

Review Comment:
   added



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -841,117 +842,255 @@ public static UTF8String translate(final UTF8String 
input,
     return UTF8String.fromString(sb.toString());
   }
 
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to the UTF8_LCASE collation. String trimming is 
performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrim(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
-    if (trimString == null) {
-      return null;
-    }
+    return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
 
-    UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-    return lowercaseTrimRight(leftTrimmed, trimString);
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to all ICU collations in Spark. String trimming 
is performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for ICU collations)
+   */
+  public static UTF8String trim(
+      final UTF8String srcString,
+      final UTF8String trimString,
+      final int collationId) {
+    return trimRight(trimLeft(srcString, trimString, collationId), trimString, 
collationId);
   }
 
+  /**
+   * Trims the `srcString` string from the left side using the specified 
`trimString` characters,
+   * with respect to the UTF8_LCASE collation. For UTF8_LCASE, the method 
first creates a hash
+   * set of lowercased code points in `trimString`, and then iterates over the 
`srcString` from
+   * the left side, until reaching a character whose lowercased code point is 
not in the hash set.
+   * Finally, the method returns the substring from that position to the end 
of `srcString`.
+   * If `trimString` is null, null is returned. If `trimString` is empty, 
`srcString` is returned.
+   *
+   * @param srcString the input string to be trimmed from the left end of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrimLeft(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
+    // Matching the default UTF8String behavior for null `trimString`.
     if (trimString == null) {
       return null;
     }
 
-    // The searching byte position in the srcString.
-    int searchIdx = 0;
-    // The byte position of a first non-matching character in the srcString.
-    int trimByteIdx = 0;
-    // Number of bytes in srcString.
-    int numBytes = srcString.numBytes();
-    // Convert trimString to lowercase, so it can be searched properly.
-    UTF8String lowercaseTrimString = trimString.toLowerCase();
-
-    while (searchIdx < numBytes) {
-      UTF8String searchChar = srcString.copyUTF8String(
-        searchIdx,
-        searchIdx + 
UTF8String.numBytesForFirstByte(srcString.getByte(searchIdx)) - 1);
-      int searchCharBytes = searchChar.numBytes();
-
-      // Try to find the matching for the searchChar in the trimString.
-      if (lowercaseTrimString.find(searchChar.toLowerCase(), 0) >= 0) {
-        trimByteIdx += searchCharBytes;
-        searchIdx += searchCharBytes;
-      } else {
-        // No matching, exit the search.
-        break;
+    // Create a hash set of lowercased code points for all characters of 
`trimString`.
+    HashSet<Integer> trimChars = new HashSet<>();
+    Iterator<Integer> trimIter = trimString.codePointIterator();
+    while (trimIter.hasNext()) 
trimChars.add(getLowercaseCodePoint(trimIter.next()));
+
+    // Iterate over `srcString` from the left to find the first character that 
is not in the set.
+    int searchIndex = 0, codePoint;
+    Iterator<Integer> srcIter = srcString.codePointIterator();
+    while (srcIter.hasNext()) {
+      codePoint = getLowercaseCodePoint(srcIter.next());
+      // Special handling for Turkish dotted uppercase letter I.
+      if (codePoint == CODE_POINT_LOWERCASE_I && srcIter.hasNext() &&
+          trimChars.contains(CODE_POINT_COMBINED_LOWERCASE_I_DOT)) {
+        int nextCodePoint = getLowercaseCodePoint(srcIter.next());
+        if ((trimChars.contains(codePoint) && 
trimChars.contains(nextCodePoint))
+          || nextCodePoint == CODE_POINT_COMBINING_DOT) searchIndex += 2;
+        else {

Review Comment:
   fixed



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -841,117 +842,255 @@ public static UTF8String translate(final UTF8String 
input,
     return UTF8String.fromString(sb.toString());
   }
 
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to the UTF8_LCASE collation. String trimming is 
performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrim(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
-    if (trimString == null) {
-      return null;
-    }
+    return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
 
-    UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-    return lowercaseTrimRight(leftTrimmed, trimString);
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to all ICU collations in Spark. String trimming 
is performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for ICU collations)
+   */
+  public static UTF8String trim(
+      final UTF8String srcString,
+      final UTF8String trimString,
+      final int collationId) {
+    return trimRight(trimLeft(srcString, trimString, collationId), trimString, 
collationId);
   }
 
+  /**
+   * Trims the `srcString` string from the left side using the specified 
`trimString` characters,
+   * with respect to the UTF8_LCASE collation. For UTF8_LCASE, the method 
first creates a hash
+   * set of lowercased code points in `trimString`, and then iterates over the 
`srcString` from
+   * the left side, until reaching a character whose lowercased code point is 
not in the hash set.
+   * Finally, the method returns the substring from that position to the end 
of `srcString`.
+   * If `trimString` is null, null is returned. If `trimString` is empty, 
`srcString` is returned.
+   *
+   * @param srcString the input string to be trimmed from the left end of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrimLeft(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
+    // Matching the default UTF8String behavior for null `trimString`.
     if (trimString == null) {
       return null;
     }
 
-    // The searching byte position in the srcString.
-    int searchIdx = 0;
-    // The byte position of a first non-matching character in the srcString.
-    int trimByteIdx = 0;
-    // Number of bytes in srcString.
-    int numBytes = srcString.numBytes();
-    // Convert trimString to lowercase, so it can be searched properly.
-    UTF8String lowercaseTrimString = trimString.toLowerCase();
-
-    while (searchIdx < numBytes) {
-      UTF8String searchChar = srcString.copyUTF8String(
-        searchIdx,
-        searchIdx + 
UTF8String.numBytesForFirstByte(srcString.getByte(searchIdx)) - 1);
-      int searchCharBytes = searchChar.numBytes();
-
-      // Try to find the matching for the searchChar in the trimString.
-      if (lowercaseTrimString.find(searchChar.toLowerCase(), 0) >= 0) {
-        trimByteIdx += searchCharBytes;
-        searchIdx += searchCharBytes;
-      } else {
-        // No matching, exit the search.
-        break;
+    // Create a hash set of lowercased code points for all characters of 
`trimString`.
+    HashSet<Integer> trimChars = new HashSet<>();
+    Iterator<Integer> trimIter = trimString.codePointIterator();
+    while (trimIter.hasNext()) 
trimChars.add(getLowercaseCodePoint(trimIter.next()));
+
+    // Iterate over `srcString` from the left to find the first character that 
is not in the set.
+    int searchIndex = 0, codePoint;
+    Iterator<Integer> srcIter = srcString.codePointIterator();
+    while (srcIter.hasNext()) {
+      codePoint = getLowercaseCodePoint(srcIter.next());
+      // Special handling for Turkish dotted uppercase letter I.
+      if (codePoint == CODE_POINT_LOWERCASE_I && srcIter.hasNext() &&
+          trimChars.contains(CODE_POINT_COMBINED_LOWERCASE_I_DOT)) {
+        int nextCodePoint = getLowercaseCodePoint(srcIter.next());
+        if ((trimChars.contains(codePoint) && 
trimChars.contains(nextCodePoint))
+          || nextCodePoint == CODE_POINT_COMBINING_DOT) searchIndex += 2;
+        else {
+          if (trimChars.contains(codePoint)) ++searchIndex;
+          break;
+        }
+      } else if (trimChars.contains(codePoint)) {
+        ++searchIndex;
       }
+      else break;

Review Comment:
   fixed



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -841,117 +842,255 @@ public static UTF8String translate(final UTF8String 
input,
     return UTF8String.fromString(sb.toString());
   }
 
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to the UTF8_LCASE collation. String trimming is 
performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrim(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
-    if (trimString == null) {
-      return null;
-    }
+    return lowercaseTrimRight(lowercaseTrimLeft(srcString, trimString), 
trimString);
+  }
 
-    UTF8String leftTrimmed = lowercaseTrimLeft(srcString, trimString);
-    return lowercaseTrimRight(leftTrimmed, trimString);
+  /**
+   * Trims the `srcString` string from both ends of the string using the 
specified `trimString`
+   * characters, with respect to all ICU collations in Spark. String trimming 
is performed by
+   * first trimming the left side of the string, and then trimming the right 
side of the string.
+   * The method returns the trimmed string. If the `trimString` is null, the 
method returns null.
+   *
+   * @param srcString the input string to be trimmed from both ends of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for ICU collations)
+   */
+  public static UTF8String trim(
+      final UTF8String srcString,
+      final UTF8String trimString,
+      final int collationId) {
+    return trimRight(trimLeft(srcString, trimString, collationId), trimString, 
collationId);
   }
 
+  /**
+   * Trims the `srcString` string from the left side using the specified 
`trimString` characters,
+   * with respect to the UTF8_LCASE collation. For UTF8_LCASE, the method 
first creates a hash
+   * set of lowercased code points in `trimString`, and then iterates over the 
`srcString` from
+   * the left side, until reaching a character whose lowercased code point is 
not in the hash set.
+   * Finally, the method returns the substring from that position to the end 
of `srcString`.
+   * If `trimString` is null, null is returned. If `trimString` is empty, 
`srcString` is returned.
+   *
+   * @param srcString the input string to be trimmed from the left end of the 
string
+   * @param trimString the trim string characters to trim
+   * @return the trimmed string (for UTF8_LCASE collation)
+   */
   public static UTF8String lowercaseTrimLeft(
       final UTF8String srcString,
       final UTF8String trimString) {
-    // Matching UTF8String behavior for null `trimString`.
+    // Matching the default UTF8String behavior for null `trimString`.
     if (trimString == null) {
       return null;
     }
 
-    // The searching byte position in the srcString.
-    int searchIdx = 0;
-    // The byte position of a first non-matching character in the srcString.
-    int trimByteIdx = 0;
-    // Number of bytes in srcString.
-    int numBytes = srcString.numBytes();
-    // Convert trimString to lowercase, so it can be searched properly.
-    UTF8String lowercaseTrimString = trimString.toLowerCase();
-
-    while (searchIdx < numBytes) {
-      UTF8String searchChar = srcString.copyUTF8String(
-        searchIdx,
-        searchIdx + 
UTF8String.numBytesForFirstByte(srcString.getByte(searchIdx)) - 1);
-      int searchCharBytes = searchChar.numBytes();
-
-      // Try to find the matching for the searchChar in the trimString.
-      if (lowercaseTrimString.find(searchChar.toLowerCase(), 0) >= 0) {
-        trimByteIdx += searchCharBytes;
-        searchIdx += searchCharBytes;
-      } else {
-        // No matching, exit the search.
-        break;
+    // Create a hash set of lowercased code points for all characters of 
`trimString`.
+    HashSet<Integer> trimChars = new HashSet<>();
+    Iterator<Integer> trimIter = trimString.codePointIterator();
+    while (trimIter.hasNext()) 
trimChars.add(getLowercaseCodePoint(trimIter.next()));
+
+    // Iterate over `srcString` from the left to find the first character that 
is not in the set.
+    int searchIndex = 0, codePoint;
+    Iterator<Integer> srcIter = srcString.codePointIterator();
+    while (srcIter.hasNext()) {
+      codePoint = getLowercaseCodePoint(srcIter.next());
+      // Special handling for Turkish dotted uppercase letter I.
+      if (codePoint == CODE_POINT_LOWERCASE_I && srcIter.hasNext() &&
+          trimChars.contains(CODE_POINT_COMBINED_LOWERCASE_I_DOT)) {
+        int nextCodePoint = getLowercaseCodePoint(srcIter.next());
+        if ((trimChars.contains(codePoint) && 
trimChars.contains(nextCodePoint))
+          || nextCodePoint == CODE_POINT_COMBINING_DOT) searchIndex += 2;
+        else {
+          if (trimChars.contains(codePoint)) ++searchIndex;
+          break;
+        }
+      } else if (trimChars.contains(codePoint)) {
+        ++searchIndex;
       }
+      else break;
     }
 
-    if (searchIdx == 0) {
-      // Nothing trimmed - return original string (not converted to lowercase).
-      return srcString;
+    // Return the substring from that position to the end of the string.
+    return searchIndex == 0 ? srcString : srcString.substring(searchIndex, 
srcString.numChars());
+  }
+
+  /**
+   * Trims the `srcString` string from the left side using the specified 
`trimString` characters,
+   * with respect to ICU collations. For these collations, the method iterates 
over `srcString`
+   * from left to right, and repeatedly skips the longest possible substring 
that matches any
+   * character in `trimString`, until reaching a character that is not found 
in `trimString`.
+   * Finally, the method returns the substring from that position to the end 
of `srcString`.
+   * If `trimString` is null, null is returned. If `trimString` is empty, 
`srcString` is returned.
+   *
+   * @param srcString the input string to be trimmed from the left end of the 
string
+   * @param trimString the trim string characters to trim

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations [spark]

Reply via email to