Re: [PR] [CALCITE-5979] Add REGEXP_REPLACE function (enabled in BigQuery library) [calcite]

via GitHub Tue, 10 Oct 2023 11:15:47 -0700


tanclary commented on code in PR #3460:
URL: https://github.com/apache/calcite/pull/3460#discussion_r1353072272



##########
babel/src/test/resources/sql/big-query.iq:
##########
@@ -707,7 +707,7 @@ SELECT SAFE_DIVIDE(CAST(1.7e308 as DOUBLE), CAST(1.7e-308 
as DOUBLE)) as double_
 
 !ok
 
-SELECT SAFE_DIVIDE(CAST(-3.5e75 AS DECIMAL(76, 0)), CAST(3.5e-75 AS 
DECIMAL(76, 0))) 

Review Comment:
   Oops thanks for fixing



##########
core/src/main/java/org/apache/calcite/runtime/SqlFunctions.java:
##########
@@ -431,7 +433,7 @@ private static boolean checkPosOccurrenceParamValues(int 
position,
         throw 
RESOURCE.invalidIntegerInputForRegexpFunctions(Integer.toString(occurrencePosition),
             "occurrence_position", methodName).ex();
       }
-      if (position <= value.length()) {
+      if (position > value.length()) {

Review Comment:
   Was this wrong before? Or why the change?



##########
core/src/main/java/org/apache/calcite/sql/validate/SqlConformanceEnum.java:
##########
@@ -240,6 +240,15 @@ public enum SqlConformanceEnum implements SqlConformance {
     }
   }
 
+  @Override public boolean isRegexReplaceCaptureGroupDollarIndexed() {

Review Comment:
   I know I suggested this but reading it now I'm wondering whether this 
belongs in 
https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql2rel/StandardConvertletTable.java#L1288
 especially since it is so function-specific. Could be a good question for 
@julianhyde 



##########
core/src/test/java/org/apache/calcite/test/SqlFunctionsTest.java:
##########
@@ -482,6 +482,35 @@ static <E> List<E> list() {
     }
   }
 
+  @Test void testRegexpReplaceNonDollarIndexed() {
+    final SqlFunctions.RegexFunction f = new SqlFunctions.RegexFunction();
+    assertThat(f.regexpReplaceNonDollarIndexed("abascusB", "b", "X"), 
is("aXascusB"));
+    assertThat(f.regexpReplaceNonDollarIndexed("abc01def02ghi", "[a-z]+", 
"X"), is("X01X02X"));
+    assertThat(f.regexpReplaceNonDollarIndexed("a0b1c2d3", "0|2", "X"), 
is("aXb1cXd3"));
+

Review Comment:
   remove blank line?



##########
core/src/main/java/org/apache/calcite/runtime/SqlFunctions.java:
##########
@@ -603,6 +605,38 @@ public String regexpReplace(String s, String regex, String 
replacement,
       return Unsafe.regexpReplace(s, pattern, replacement, pos, occurrence);
     }
 
+    /** SQL {@code REGEXP_REPLACE} function with 3 arguments with
+     * {@code \\} based indexing for capturing groups.
+     */
+    public String regexpReplaceNonDollarIndexed(String s, String regex,
+        String replacement) {
+      // Preprocessing to convert double-backslash based indexing for capturing
+      // groups into $ based indices recognized by java regex.

Review Comment:
   nit: "$-based" and "double-backslash-based"



##########
core/src/main/java/org/apache/calcite/adapter/enumerable/RexImpTable.java:
##########
@@ -2428,6 +2425,33 @@ private static class ContainsSubstrImplementor extends 
AbstractRexCallImplemento
     }
   }
 
+  /** Implementor for the {@code REGEXP_REPLACE} function. */
+  private static class RegexpReplaceImplementor extends 
AbstractRexCallImplementor {
+    RegexpReplaceImplementor() {
+      super("regexp_replace", NullPolicy.STRICT, false);
+    }
+
+    @Override Expression implementSafe(final RexToLixTranslator translator,
+        final RexCall call, final List<Expression> argValueList) {
+
+      final boolean dollarIndexed =

Review Comment:
   maybe a small comment here? When I read `dollarIndexed` I'm not sure what 
that means without outside context.



##########
core/src/main/java/org/apache/calcite/adapter/enumerable/RexImpTable.java:
##########
@@ -2428,6 +2425,33 @@ private static class ContainsSubstrImplementor extends 
AbstractRexCallImplemento
     }
   }
 
+  /** Implementor for the {@code REGEXP_REPLACE} function. */
+  private static class RegexpReplaceImplementor extends 
AbstractRexCallImplementor {
+    RegexpReplaceImplementor() {
+      super("regexp_replace", NullPolicy.STRICT, false);
+    }
+
+    @Override Expression implementSafe(final RexToLixTranslator translator,
+        final RexCall call, final List<Expression> argValueList) {
+

Review Comment:
   can we remove this newline



##########
core/src/main/java/org/apache/calcite/runtime/SqlFunctions.java:
##########
@@ -603,6 +605,38 @@ public String regexpReplace(String s, String regex, String 
replacement,
       return Unsafe.regexpReplace(s, pattern, replacement, pos, occurrence);
     }
 
+    /** SQL {@code REGEXP_REPLACE} function with 3 arguments with
+     * {@code \\} based indexing for capturing groups.
+     */
+    public String regexpReplaceNonDollarIndexed(String s, String regex,
+        String replacement) {
+      // Preprocessing to convert double-backslash based indexing for capturing
+      // groups into $ based indices recognized by java regex.
+
+      // Explicitly escaping any $ symbols coming from input
+      // to ignore them from being considered as capturing group index.
+      String indexedReplacement = replacement.replace("\\\\", "\\")
+          .replace("$", "\\$");
+
+      // Check each occurrence of escaped chars, convert \<n> integers into
+      // $<n> indices, keep \\ and \$, throw an error for any other invalid 
escapes.
+      int lastOccIdx = 0;
+      while (lastOccIdx != -1) {

Review Comment:
   This is a bit nested is there a way to flatten it some?



##########
core/src/main/java/org/apache/calcite/sql/fun/SqlLibraryOperators.java:
##########
@@ -515,7 +515,9 @@ static RelDataType deriveTypeSplit(SqlOperatorBinding 
operatorBinding,
           
OperandTypes.STRING_STRING_OPTIONAL_INTEGER_OPTIONAL_INTEGER_OPTIONAL_INTEGER,
           SqlFunctionCategory.STRING);
 
-  @LibraryOperator(libraries = {MYSQL, ORACLE})
+  /** The "REGEXP_REPLACE(value, regexp, rep [, pos [, occurrence [, 
matchType]]])" function.
+   * Replaces all substrings of value that match regexp with rep and returns 
modified value. */
+  @LibraryOperator(libraries = {MYSQL, ORACLE, BIG_QUERY})

Review Comment:
   Make this alphabetical if you don't mind



##########
core/src/main/java/org/apache/calcite/runtime/SqlFunctions.java:
##########
@@ -546,7 +548,7 @@ public int regexpInstr(String value, String regex, int 
position,
       final String methodName = "REGEXP_INSTR";
       final Pattern pattern = validateRegexPattern(regex, methodName);
 
-      if (checkPosOccurrenceParamValues(position, occurrence, 
occurrencePosition, value,
+      if (!validatePosOccurrenceParamValues(position, occurrence, 
occurrencePosition, value,
           methodName) || regex.isEmpty()) {

Review Comment:
   Is it more optimal to check `regex.isEmpty()` first? Like if that is a 
cheaper operation maybe we should do it first? What do you think?



##########
site/_docs/reference.md:
##########
@@ -2791,7 +2791,7 @@ BigQuery's type system uses confusingly different names 
for types and functions:
 | b | REGEXP_EXTRACT(string, regexp [, position [, occurrence]]) | Returns the 
substring in *string* that matches the *regexp*, starting search at *position* 
(default 1), and until locating the nth *occurrence* (default 1). Returns NULL 
if there is no match
 | b | REGEXP_EXTRACT_ALL(string, regexp)             | Returns an array of all 
substrings in *string* that matches the *regexp*. Returns an empty array if 
there is no match
 | b | REGEXP_INSTR(string, regexp [, position [, occurrence [, 
occurrence_position]]]) | Returns the lowest 1-based position of the substring 
in *string* that matches the *regexp*, starting search at *position* (default 
1), and until locating the nth *occurrence* (default 1). Setting 
occurrence_position (default 0) to 1 returns the end position of substring + 1. 
Returns 0 if there is no match
-| m o | REGEXP_REPLACE(string, regexp, rep [, pos [, occurrence [, 
matchType]]]) | Replaces all substrings of *string* that match *regexp* with 
*rep* at the starting *pos* in expr (if omitted, the default is 1), 
*occurrence* means which occurrence of a match to search for (if omitted, the 
default is 1), *matchType* specifies how to perform matching
+| m o b | REGEXP_REPLACE(string, regexp, rep [, pos [, occurrence [, 
matchType]]]) | Replaces all substrings of *string* that match *regexp* with 
*rep* at the starting *pos* in expr (if omitted, the default is 1), 
*occurrence* means which occurrence of a match to search for (if omitted, the 
default is 1), *matchType* specifies how to perform matching

Review Comment:
   nit: change "occurrence means which ..." to "occurrence specifies which ..." 
to be consistent with the rest of the description



##########
site/_docs/reference.md:
##########
@@ -2791,7 +2791,7 @@ BigQuery's type system uses confusingly different names 
for types and functions:
 | b | REGEXP_EXTRACT(string, regexp [, position [, occurrence]]) | Returns the 
substring in *string* that matches the *regexp*, starting search at *position* 
(default 1), and until locating the nth *occurrence* (default 1). Returns NULL 
if there is no match
 | b | REGEXP_EXTRACT_ALL(string, regexp)             | Returns an array of all 
substrings in *string* that matches the *regexp*. Returns an empty array if 
there is no match
 | b | REGEXP_INSTR(string, regexp [, position [, occurrence [, 
occurrence_position]]]) | Returns the lowest 1-based position of the substring 
in *string* that matches the *regexp*, starting search at *position* (default 
1), and until locating the nth *occurrence* (default 1). Setting 
occurrence_position (default 0) to 1 returns the end position of substring + 1. 
Returns 0 if there is no match
-| m o | REGEXP_REPLACE(string, regexp, rep [, pos [, occurrence [, 
matchType]]]) | Replaces all substrings of *string* that match *regexp* with 
*rep* at the starting *pos* in expr (if omitted, the default is 1), 
*occurrence* means which occurrence of a match to search for (if omitted, the 
default is 1), *matchType* specifies how to perform matching
+| m o b | REGEXP_REPLACE(string, regexp, rep [, pos [, occurrence [, 
matchType]]]) | Replaces all substrings of *string* that match *regexp* with 
*rep* at the starting *pos* in expr (if omitted, the default is 1), 
*occurrence* means which occurrence of a match to search for (if omitted, the 
default is 1), *matchType* specifies how to perform matching

Review Comment:
   Make alphabetical as well



##########
testkit/src/main/java/org/apache/calcite/test/SqlOperatorTest.java:
##########
@@ -4810,12 +4810,34 @@ private static void checkIf(SqlOperatorFixture f) {
           "VARCHAR NOT NULL");
       f.checkString("regexp_replace('abc\t\ndef\t\nghi', '\\w+', '+')", 
"+\t\n+\t\n+",
           "VARCHAR NOT NULL");
+
       f.checkQuery("select regexp_replace('a b c', 'b', 'X')");
       f.checkQuery("select regexp_replace('a b c', 'b', 'X', 1)");
       f.checkQuery("select regexp_replace('a b c', 'b', 'X', 1, 3)");
       f.checkQuery("select regexp_replace('a b c', 'b', 'X', 1, 3, 'i')");
     };
-    f0.forEachLibrary(list(SqlLibrary.MYSQL, SqlLibrary.ORACLE), consumer);
+    f0.forEachLibrary(list(SqlLibrary.MYSQL, SqlLibrary.ORACLE, 
SqlLibrary.BIG_QUERY), consumer);
+

Review Comment:
   Do any of these tests check what happens if you use the wrong syntax for 
either dialect? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CALCITE-5979] Add REGEXP_REPLACE function (enabled in BigQuery library) [calcite]

Reply via email to