Re: [PR] [SPARK-50101][SQL] Fix collated behavior of StringToMap expression [spark]

via GitHub Fri, 25 Oct 2024 09:07:31 -0700


stevomitric commented on code in PR #48642:
URL: https://github.com/apache/spark/pull/48642#discussion_r1816837756



##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -1001,16 +1001,51 @@ class CollationSQLExpressionsSuite
         collation: String,
         result: R)
     val testCases = Seq(
-      StringToMapTestCase("a:1,b:2,c:3", ",", ":", "UTF8_BINARY",
+      StringToMapTestCase(
+        "a:1,b:2,c:3",
+        ",",
+        ":",
+        "UTF8_BINARY",

Review Comment:
   Actually, this is the product of `dev/scalafmt` so i think this is more 
correct (the scalastyle test was failing).



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String 
string, final UTF8String
     return strings.toArray(new UTF8String[0]);
   }
 
+  /**
+   * Splits the `string` into an array of substrings based on the `delimiter` 
regex, with respect
+   * to the maximum number of substrings `limit`.
+   *
+   * @param string the string to be split
+   * @param delimiter the delimiter regex to split the string
+   * @param limit the maximum number of substrings to return
+   * @return an array of substrings
+   */
+  public static UTF8String[] split(final UTF8String string, final UTF8String 
delimiter,
+      final int limit, final int collationId) {
+    CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+    assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+        "Unsupported collation type for split operation.";
+
+    if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+      return string.split(delimiter, limit);
+    } else {
+      return lowercaseSplit(string, delimiter, limit);
+    }

Review Comment:
   This was placed here to follow the existing `splitSQL` function which is 
already in this file - please see: https://github.com/apache/spark/pull/47621. 
In my opinion since they have very similar APIs, we shouldn't mix them.
   
   If you are really motivated to move this, i think this should be a part of a 
different, refactor PR that will take care of them both.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String 
string, final UTF8String
     return strings.toArray(new UTF8String[0]);
   }
 
+  /**
+   * Splits the `string` into an array of substrings based on the `delimiter` 
regex, with respect
+   * to the maximum number of substrings `limit`.
+   *
+   * @param string the string to be split
+   * @param delimiter the delimiter regex to split the string
+   * @param limit the maximum number of substrings to return
+   * @return an array of substrings
+   */
+  public static UTF8String[] split(final UTF8String string, final UTF8String 
delimiter,
+      final int limit, final int collationId) {
+    CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+    assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+        "Unsupported collation type for split operation.";

Review Comment:
   Ideally we would fail in analysis but this helps with readability since the 
below if statement doesn't cover all 3 cases, binary, lcase and ICU (which is 
what we usually have) - it might be misleading to the reader that we specialize 
for binary and everything else, while in reality we don't cover ICU case.



##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String 
string, final UTF8String
     return strings.toArray(new UTF8String[0]);
   }
 
+  /**
+   * Splits the `string` into an array of substrings based on the `delimiter` 
regex, with respect
+   * to the maximum number of substrings `limit`.
+   *
+   * @param string the string to be split
+   * @param delimiter the delimiter regex to split the string
+   * @param limit the maximum number of substrings to return
+   * @return an array of substrings
+   */
+  public static UTF8String[] split(final UTF8String string, final UTF8String 
delimiter,
+      final int limit, final int collationId) {
+    CollationFactory.Collation collation = 
CollationFactory.fetchCollation(collationId);
+    assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+        "Unsupported collation type for split operation.";
+
+    if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+      return string.split(delimiter, limit);
+    } else {
+      return lowercaseSplit(string, delimiter, limit);
+    }
+  }
+
+  public static UTF8String[] lowercaseSplit(final UTF8String string, final 
UTF8String delimiter,
+      final int limit) {
+    if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+    if (string.numBytes() == 0) return new UTF8String[] { 
UTF8String.EMPTY_UTF8 };

Review Comment:
   In what sense? In this function we use case-insensitive split while in the 
`UTF8String.split` we don't. This should already be communicated through the 
function name. Are you referring to invalid char handling and/or empty pattern 
matching (as shown in the code snippet you gave)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50101][SQL] Fix collated behavior of StringToMap expression [spark]

Reply via email to