uros-db commented on code in PR #48642:
URL: https://github.com/apache/spark/pull/48642#discussion_r1816728135
##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String
string, final UTF8String
return strings.toArray(new UTF8String[0]);
}
+ /**
+ * Splits the `string` into an array of substrings based on the `delimiter`
regex, with respect
+ * to the maximum number of substrings `limit`.
+ *
+ * @param string the string to be split
+ * @param delimiter the delimiter regex to split the string
+ * @param limit the maximum number of substrings to return
+ * @return an array of substrings
+ */
+ public static UTF8String[] split(final UTF8String string, final UTF8String
delimiter,
+ final int limit, final int collationId) {
+ CollationFactory.Collation collation =
CollationFactory.fetchCollation(collationId);
+ assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+ "Unsupported collation type for split operation.";
+
+ if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+ return string.split(delimiter, limit);
+ } else {
+ return lowercaseSplit(string, delimiter, limit);
+ }
Review Comment:
branching execution based on collation is not something that should be done
in `CollationAwareUTF8String`
please see: `CollationSupport.java` and follow the implementation pattern to
introduce a new class for `StringToMap` if necessary
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]