amaliujia commented on a change in pull request #35352:
URL: https://github.com/apache/spark/pull/35352#discussion_r827245478
##########
File path:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
##########
@@ -999,13 +1000,22 @@ public static UTF8String concatWs(UTF8String separator,
UTF8String... inputs) {
}
public UTF8String[] split(UTF8String pattern, int limit) {
+ return split(pattern, limit, false);
+ }
+
+ public UTF8String[] split(UTF8String pattern, int limit, boolean ifQuote) {
Review comment:
Sure, this will be a very important decision and will change
implementation details. We should make a call before continuing checking other
code details.
Current split treats `pattern` as a regex, while split_part treats `pattern`
as a fixed string (or quoted regex pattern). Because we know that split_part
can be turned into ElementAt(split()), that is why I introduced branches into
existing code path: to re-use code as much as possible while maintaining minor
differences between function specs.
There are two principles:
1. Implement a function that is aligned with most of the vendors.
2. Re-use code as much as possible but keep internal consistency.
Thus leads to two options:
Option 1: we implement split_part separately without re-using element_at and
split, this will make the behavior compatible with others but might not produce
minimal code addition.
Option 2: we change split_part to follow split, thus leads to very nice
code re-use, but our split will be pretty unique.
What do you think @cloud-fan @srielau?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]