stevomitric commented on code in PR #48642:
URL: https://github.com/apache/spark/pull/48642#discussion_r1816837756
##########
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala:
##########
@@ -1001,16 +1001,51 @@ class CollationSQLExpressionsSuite
collation: String,
result: R)
val testCases = Seq(
- StringToMapTestCase("a:1,b:2,c:3", ",", ":", "UTF8_BINARY",
+ StringToMapTestCase(
+ "a:1,b:2,c:3",
+ ",",
+ ":",
+ "UTF8_BINARY",
Review Comment:
Actually, this is the product of `dev/scalafmt` so i think this is more
correct (the scalastyle test was failing).
##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String
string, final UTF8String
return strings.toArray(new UTF8String[0]);
}
+ /**
+ * Splits the `string` into an array of substrings based on the `delimiter`
regex, with respect
+ * to the maximum number of substrings `limit`.
+ *
+ * @param string the string to be split
+ * @param delimiter the delimiter regex to split the string
+ * @param limit the maximum number of substrings to return
+ * @return an array of substrings
+ */
+ public static UTF8String[] split(final UTF8String string, final UTF8String
delimiter,
+ final int limit, final int collationId) {
+ CollationFactory.Collation collation =
CollationFactory.fetchCollation(collationId);
+ assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+ "Unsupported collation type for split operation.";
+
+ if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+ return string.split(delimiter, limit);
+ } else {
+ return lowercaseSplit(string, delimiter, limit);
+ }
Review Comment:
This was placed here to follow the existing `splitSQL` function which is
already in this file - please see: https://github.com/apache/spark/pull/47621.
In my opinion since they have very similar APIs, we shouldn't mix them.
If you are really motivated to move this, i think this should be a part of a
different, refactor PR that will take care of them both.
##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String
string, final UTF8String
return strings.toArray(new UTF8String[0]);
}
+ /**
+ * Splits the `string` into an array of substrings based on the `delimiter`
regex, with respect
+ * to the maximum number of substrings `limit`.
+ *
+ * @param string the string to be split
+ * @param delimiter the delimiter regex to split the string
+ * @param limit the maximum number of substrings to return
+ * @return an array of substrings
+ */
+ public static UTF8String[] split(final UTF8String string, final UTF8String
delimiter,
+ final int limit, final int collationId) {
+ CollationFactory.Collation collation =
CollationFactory.fetchCollation(collationId);
+ assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+ "Unsupported collation type for split operation.";
Review Comment:
Ideally we would fail in analysis but this helps with readability since the
below if statement doesn't cover all 3 cases, binary, lcase and ICU (which is
what we usually have) - it might be misleading to the reader that we specialize
for binary and everything else, while in reality we don't cover ICU case.
##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String
string, final UTF8String
return strings.toArray(new UTF8String[0]);
}
+ /**
+ * Splits the `string` into an array of substrings based on the `delimiter`
regex, with respect
+ * to the maximum number of substrings `limit`.
+ *
+ * @param string the string to be split
+ * @param delimiter the delimiter regex to split the string
+ * @param limit the maximum number of substrings to return
+ * @return an array of substrings
+ */
+ public static UTF8String[] split(final UTF8String string, final UTF8String
delimiter,
+ final int limit, final int collationId) {
+ CollationFactory.Collation collation =
CollationFactory.fetchCollation(collationId);
+ assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+ "Unsupported collation type for split operation.";
+
+ if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+ return string.split(delimiter, limit);
+ } else {
+ return lowercaseSplit(string, delimiter, limit);
+ }
+ }
+
+ public static UTF8String[] lowercaseSplit(final UTF8String string, final
UTF8String delimiter,
+ final int limit) {
+ if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+ if (string.numBytes() == 0) return new UTF8String[] {
UTF8String.EMPTY_UTF8 };
Review Comment:
In what sense? In this function we use case-insensitive split while in the
`UTF8String.split` we don't. This should already be communicated through the
function name. Are you referring to invalid char handling and/or empty pattern
matching (as shown in the code snippet you gave)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]