Myracle commented on code in PR #27577:
URL: https://github.com/apache/flink/pull/27577#discussion_r2801879546


##########
flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSplitFunction.java:
##########
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.runtime.functions.scalar;
+
+import org.apache.flink.annotation.Internal;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.StringData;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.functions.SpecializedFunction;
+
+import javax.annotation.Nullable;
+
+import java.util.regex.Pattern;
+
+import static 
org.apache.flink.table.runtime.functions.SqlFunctionUtils.getRegexpPattern;
+
+/**
+ * Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}.
+ *
+ * <p>Splits a string by a regular expression pattern and returns an array of 
substrings.
+ *
+ * <p>Examples:
+ *
+ * <pre>{@code
+ * REGEXP_SPLIT('Hello123World456', '[0-9]+') = ['Hello', 'World', '']
+ * REGEXP_SPLIT('a,b;c', '[,;]') = ['a', 'b', 'c']
+ * REGEXP_SPLIT('one  two   three', '\\s+') = ['one', 'two', 'three']
+ * }</pre>
+ */
+@Internal
+public class RegexpSplitFunction extends BuiltInScalarFunction {
+
+    public RegexpSplitFunction(SpecializedFunction.SpecializedContext context) 
{
+        super(BuiltInFunctionDefinitions.REGEXP_SPLIT, context);
+    }
+
+    public @Nullable ArrayData eval(@Nullable StringData str, @Nullable 
StringData regex) {
+        if (str == null || regex == null) {
+            return null;
+        }
+
+        String regexStr = regex.toString();
+        if (regexStr.isEmpty()) {
+            // If regex is empty, split by each character
+            String strValue = str.toString();
+            StringData[] result = new StringData[strValue.length()];
+            for (int i = 0; i < strValue.length(); i++) {
+                result[i] = 
StringData.fromString(String.valueOf(strValue.charAt(i)));
+            }
+            return new GenericArrayData(result);
+        }
+
+        Pattern pattern = getRegexpPattern(regexStr);

Review Comment:
   Thanks for the review!
   
   The reason I added getRegexpPattern() instead of using getRegexpMatcher() is 
that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method 
is on the Pattern class, not the Matcher class.
   
   The existing getRegexpMatcher() returns a Matcher object which is designed 
for matching operations like find(), group(), etc. - this works perfectly for 
other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that 
need to iterate through matches.
   
   However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to 
split the input string by the pattern, which requires direct access to the 
Pattern object.
   That said, if you prefer, I could inline the cache access directly in 
RegexpSplitFunction to avoid adding a new utility method:
   ```
   Pattern pattern;
   try {
       pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr);
   } catch (PatternSyntaxException e) {
       return null;
   }
   ```
   Please let me know which approach you'd prefer:
   1. Keep getRegexpPattern() as a reusable utility (current approach) - could 
be useful for future functions that need direct Pattern access
   2. Inline the cache access directly in RegexpSplitFunction



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to