raminqaf commented on code in PR #27577:
URL: https://github.com/apache/flink/pull/27577#discussion_r3334683043
##########
docs/data/sql_functions.yml:
##########
@@ -409,7 +409,43 @@ string:
`str <CHAR | VARCHAR>, regex <CHAR | VARCHAR>`
+ Returns an `STRING` representation of the first matched substring.
`NULL` if any of the arguments are `NULL` or regex if invalid or pattern is not
found.
+ - sql: REGEXP_SPLIT(str, regex)
+ table: str.regexpSplit(regex)
+ description: |
+ Splits str by the regular expression regex and returns an array of
strings.
+
+ E.g., REGEXP_SPLIT('Hello123World456', '[0-9]+') returns ['Hello',
'World', ''].
+
+ `str <CHAR | VARCHAR>, regex <CHAR | VARCHAR>`
+
+ Returns an `ARRAY<STRING>` of split substrings. `NULL` if any of the
arguments are `NULL` or regex is invalid.
+>>>>>>> 8d75684590d (hotfix)
+=======
Returns an `STRING` representation of the first matched substring.
`NULL` if any of the arguments are `NULL` or regex is invalid or pattern is not
found.
+ - sql: REGEXP_SPLIT(str, regex)
+ table: str.regexpSplit(regex)
+ description: |
+ Splits str by the regular expression regex and returns an array of
strings.
+
+ E.g., REGEXP_SPLIT('Hello123World456', '[0-9]+') returns ['Hello',
'World', ''].
+
+ `str <CHAR | VARCHAR>, regex <CHAR | VARCHAR>`
+
+ Returns an `ARRAY<STRING>` of split substrings. `NULL` if any of the
arguments are `NULL` or regex is invalid.
+=======
Review Comment:
Can you please fix the docs? It seems the merge conflicts are not resolved
correctly
##########
flink-table/flink-table-common/src/main/java/org/apache/flink/table/functions/BuiltInFunctionDefinitions.java:
##########
@@ -445,6 +445,20 @@ ANY, and(logical(LogicalTypeRoot.BOOLEAN), LITERAL)
.runtimeClass("org.apache.flink.table.runtime.functions.scalar.SplitFunction")
.build();
+ public static final BuiltInFunctionDefinition REGEXP_SPLIT =
+ BuiltInFunctionDefinition.newBuilder()
+ .name("REGEXP_SPLIT")
+ .sqlName("REGEXP_SPLIT")
+ .kind(SCALAR)
+ .inputTypeStrategy(
+ sequence(
+
logical(LogicalTypeFamily.CHARACTER_STRING),
+
logical(LogicalTypeFamily.CHARACTER_STRING)))
Review Comment:
Please refer to `REGEXP_EXTRACT`
(https://github.com/apache/flink/pull/28140) and REGEXP_REPLACE`
(https://github.com/apache/flink/pull/28189) with the new validation logic
during planning time. We can catch invalid regex pattern before hand.
##########
flink-python/pyflink/table/expression.py:
##########
@@ -1362,6 +1362,18 @@ def regexp_substr(self, regex) -> 'Expression':
"""
return _binary_op("regexpSubstr")(self, regex)
+ def regexp_split(self, regex) -> 'Expression':
+ """
+ Splits the string by the regular expression regex and returns an array
of strings.
+ null if any of the arguments are null or regex is invalid.
+
+ E.g., regexp_split('Hello123World456', '[0-9]+') returns ['Hello',
'World', ''].
+
+ :param regex: A STRING expression with a matching pattern.
+ :return: An ARRAY<STRING> of split substrings.
+ """
Review Comment:
Ideally this should mirror the JavaDocs in BaseExpressions
##########
flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/SqlFunctionUtils.java:
##########
@@ -491,6 +491,24 @@ public static Matcher getRegexpMatcher(@Nullable
StringData str, @Nullable Strin
}
}
+ /**
+ * Returns a compiled Pattern object for the given regular expression
string, using a shared
+ * cache for performance optimization.
+ *
+ * @param regex the regular expression pattern string
+ * @return the compiled Pattern, or null if regex is null or invalid
+ */
+ public static @Nullable Pattern getRegexpPattern(@Nullable String regex) {
+ if (regex == null) {
+ return null;
+ }
Review Comment:
We can also make this non-null
```suggestion
public static @Nullable Pattern getRegexpPattern(String regex) {
```
##########
flink-table/flink-table-planner/src/test/java/org/apache/flink/table/planner/functions/RegexpFunctionsITCase.java:
##########
@@ -387,4 +388,94 @@ private Stream<TestSetSpec> regexpSubstrTestCases() {
"Invalid input arguments. Expected signatures
are:\n"
+ "REGEXP_SUBSTR(str
<CHARACTER_STRING>, regex <CHARACTER_STRING>)"));
}
+
+ private Stream<TestSetSpec> regexpSplitTestCases() {
+ return Stream.of(
+
TestSetSpec.forFunction(BuiltInFunctionDefinitions.REGEXP_SPLIT)
+ .onFieldsWithData(
+ "Hello123World456",
+ null,
+ "a,b;c|d",
+ "one two three",
+ 123,
+ "12345",
+ ",123,,,123,")
+ .andDataTypes(
+ DataTypes.STRING().notNull(),
+ DataTypes.STRING(),
+ DataTypes.STRING().notNull(),
+ DataTypes.STRING().notNull(),
+ DataTypes.INT().notNull(),
+ DataTypes.STRING().notNull(),
+ DataTypes.STRING())
+ // Basic regex split
+ .testResult(
+ $("f0").regexpSplit("[0-9]+"),
+ "REGEXP_SPLIT(f0, '[0-9]+')",
+ new String[] {"Hello", "World", ""},
+ DataTypes.ARRAY(DataTypes.STRING()).notNull())
+ // null input test
+ .testResult(
+ $("f0").regexpSplit(null),
+ "REGEXP_SPLIT(f0, NULL)",
+ null,
+ DataTypes.ARRAY(DataTypes.STRING()))
+ // Empty regex - split by character
+ .testResult(
+ $("f5").regexpSplit(""),
+ "REGEXP_SPLIT(f5, '')",
+ new String[] {"1", "2", "3", "4", "5"},
+ DataTypes.ARRAY(DataTypes.STRING()).notNull())
+ // null string input
+ .testResult(
+ $("f1").regexpSplit("[0-9]+"),
+ "REGEXP_SPLIT(f1, '[0-9]+')",
+ null,
+ DataTypes.ARRAY(DataTypes.STRING()))
+ // null string and null pattern
+ .testResult(
+ $("f1").regexpSplit(null),
+ "REGEXP_SPLIT(f1, null)",
+ null,
+ DataTypes.ARRAY(DataTypes.STRING()))
+ // Multi-character delimiter regex
+ .testResult(
+ $("f2").regexpSplit("[,;|]"),
+ "REGEXP_SPLIT(f2, '[,;|]')",
+ new String[] {"a", "b", "c", "d"},
+ DataTypes.ARRAY(DataTypes.STRING()).notNull())
+ // Whitespace regex
+ .testResult(
+ $("f3").regexpSplit("\\s+"),
+ "REGEXP_SPLIT(f3, '\\s+')",
+ new String[] {"one", "two", "three"},
+ DataTypes.ARRAY(DataTypes.STRING()).notNull())
+ // No match - return original string
+ .testResult(
+ $("f5").regexpSplit("[a-z]+"),
+ "REGEXP_SPLIT(f5, '[a-z]+')",
+ new String[] {"12345"},
+ DataTypes.ARRAY(DataTypes.STRING()).notNull())
+ // Invalid regex - return null
Review Comment:
Pleas add literal/non-literal invalid input tests
##########
flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSplitFunction.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.scalar;
+
+import org.apache.flink.annotation.Internal;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.StringData;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.functions.SpecializedFunction;
+
+import javax.annotation.Nullable;
+
+import java.util.regex.Pattern;
+
+import static
org.apache.flink.table.runtime.functions.SqlFunctionUtils.getRegexpPattern;
+
+/**
+ * Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}.
+ *
+ * <p>Splits a string by a regular expression pattern and returns an array of
substrings.
+ *
+ * <p>Examples:
+ *
+ * <pre>{@code
+ * REGEXP_SPLIT('Hello123World456', '[0-9]+') = ['Hello', 'World', '']
+ * REGEXP_SPLIT('a,b;c', '[,;]') = ['a', 'b', 'c']
+ * REGEXP_SPLIT('one two three', '\\s+') = ['one', 'two', 'three']
+ * }</pre>
+ */
+@Internal
+public class RegexpSplitFunction extends BuiltInScalarFunction {
+
+ public RegexpSplitFunction(SpecializedFunction.SpecializedContext context)
{
+ super(BuiltInFunctionDefinitions.REGEXP_SPLIT, context);
+ }
+
+ public @Nullable ArrayData eval(@Nullable StringData str, @Nullable
StringData regex) {
+ if (str == null || regex == null) {
+ return null;
+ }
+
+ String regexStr = regex.toString();
+ if (regexStr.isEmpty()) {
+ // If regex is empty, split by each character
+ String strValue = str.toString();
+ StringData[] result = new StringData[strValue.length()];
+ for (int i = 0; i < strValue.length(); i++) {
+ result[i] =
StringData.fromString(String.valueOf(strValue.charAt(i)));
Review Comment:
Please have a look at this PR: https://github.com/apache/flink/pull/28264
So you split the SMP correctly. Maybe we can even extract this logic into a
util.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]