uros-db commented on code in PR #48642:
URL: https://github.com/apache/spark/pull/48642#discussion_r1816734906
##########
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java:
##########
@@ -1434,6 +1435,42 @@ public static UTF8String[] icuSplitSQL(final UTF8String
string, final UTF8String
return strings.toArray(new UTF8String[0]);
}
+ /**
+ * Splits the `string` into an array of substrings based on the `delimiter`
regex, with respect
+ * to the maximum number of substrings `limit`.
+ *
+ * @param string the string to be split
+ * @param delimiter the delimiter regex to split the string
+ * @param limit the maximum number of substrings to return
+ * @return an array of substrings
+ */
+ public static UTF8String[] split(final UTF8String string, final UTF8String
delimiter,
+ final int limit, final int collationId) {
+ CollationFactory.Collation collation =
CollationFactory.fetchCollation(collationId);
+ assert collation.isUtf8BinaryType || collation.isUtf8LcaseType :
+ "Unsupported collation type for split operation.";
+
+ if (CollationFactory.fetchCollation(collationId).isUtf8BinaryType) {
+ return string.split(delimiter, limit);
+ } else {
+ return lowercaseSplit(string, delimiter, limit);
+ }
+ }
+
+ public static UTF8String[] lowercaseSplit(final UTF8String string, final
UTF8String delimiter,
+ final int limit) {
+ if (delimiter.numBytes() == 0) return new UTF8String[] { string };
+ if (string.numBytes() == 0) return new UTF8String[] {
UTF8String.EMPTY_UTF8 };
Review Comment:
could you briefly explain whether this completely matches the behaviour for
`UTF8String.split`?
for ref:
```
public UTF8String[] split(UTF8String pattern, int limit) {
// For the empty `pattern` a `split` function ignores trailing empty
strings unless original
// string is empty.
if (numBytes() != 0 && pattern.numBytes() == 0) {
int newLimit = limit > numChars() || limit <= 0 ? numChars() : limit;
byte[] input = getBytes();
int byteIndex = 0;
int charIndex = 0;
UTF8String[] result = new UTF8String[newLimit];
while (charIndex < newLimit) {
int currCharNumBytes = numBytesForFirstByte(input[byteIndex]);
result[charIndex++] = UTF8String.fromBytes(input, byteIndex,
currCharNumBytes);
byteIndex += currCharNumBytes;
}
return result;
}
return split(pattern.toString(), limit);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]