[spark] branch master updated: [SPARK-39744][SQL] Add the `REGEXP_INSTR` function

maxgekk Tue, 12 Jul 2022 12:17:28 -0700

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 3fde0ba6e67 [SPARK-39744][SQL] Add the `REGEXP_INSTR` function
3fde0ba6e67 is described below

commit 3fde0ba6e67ca45e25e8a19e9bc8f8371f12cb71
Author: Max Gekk <[email protected]>
AuthorDate: Tue Jul 12 16:19:50 2022 +0300

    [SPARK-39744][SQL] Add the `REGEXP_INSTR` function
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to add new expression `RegExpInStr`, and bind the 
expression to the function name `REGEXP_INSTR`. The `REGEXP_INSTR` function 
searches a string for a regular expression and returns an integer that 
indicates the beginning position matched substring. Positions are 1-based, not 
0-based. If no match is found, returns 0. It takes two parameters:
    1. An expression that specifies the string in which the search is to take 
place.
    2. An expression that specifies the regular expression string that is the 
pattern for the search.
    
    If the regular expression is not found, the result is **0** (this behaviour 
is similar to other DBMSs). When any of the input parameters are NULL, the 
function returns NULL too.
    
    For example:
    ```sql
    spark-sql> CREATE TABLE log (logs string);
    spark-sql> INSERT INTO log (logs) VALUES
             > ('127.0.0.1 - - [10/Jan/2022:16:55:36 -0800] "GET / HTTP/1.0" 
200 2217'),
             > ('192.168.1.99 - - [14/Feb/2022:10:27:10 -0800] "GET 
/cgi-bin/try/ HTTP/1.0" 200 3396');
    spark-sql>  SELECT REGEXP_INSTR 
(logs,'\\b\\d{1,3}\.\\d{1,3}\.\\d{1,3}\.\\d{1,3}\\b') AS IP, REGEXP_INSTR 
(logs,'([\\w:\/]+\\s[+\-]\\d{4})') AS DATE FROM log;
    1       19
    1       16
    ```
    
    ### Why are the changes needed?
    To make the migration process from other systems to Spark SQL easier, and 
achieve feature parity to such systems. For example, the systems below support 
the `REGEXP_INSTR` function, see:
     - MariaDB: https://mariadb.com/kb/en/regexp_instr/
     - Oracle: 
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions129.htm
     - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-instr
     - Snowflake: 
https://docs.snowflake.com/en/sql-reference/functions/regexp_instr.html
     - BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_instr
     - Redshift: 
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html
     - Exasol DB: 
https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_instr.htm
    - Vertica: 
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_INSTR.htm
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    By running new tests:
    ```
    $ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
    $ build/sbt "sql/test:testOnly 
org.apache.spark.sql.expressions.ExpressionInfoSuite"
    $ build/sbt "test:testOnly *.RegexpExpressionsSuite"
    $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
regexp-functions.sql"
    ```
    
    Closes #37154 from MaxGekk/regexp_instr.
    
    Authored-by: Max Gekk <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../catalyst/expressions/regexpExpressions.scala   | 85 ++++++++++++++++++++++
 .../expressions/RegexpExpressionsSuite.scala       | 14 ++++
 .../sql-functions/sql-expression-schema.md         |  1 +
 .../sql-tests/inputs/regexp-functions.sql          |  8 ++
 .../sql-tests/results/regexp-functions.sql.out     | 48 ++++++++++++
 6 files changed, 157 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 20c719aec68..088d85034db 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -586,6 +586,7 @@ object FunctionRegistry {
     expression[XPathString]("xpath_string"),
     expression[RegExpCount]("regexp_count"),
     expression[RegExpSubStr]("regexp_substr"),
+    expression[RegExpInStr]("regexp_instr"),
 
     // datetime functions
     expression[AddMonths]("add_months"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index b240e849f4d..2e30558b6c0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -1043,3 +1043,88 @@ case class RegExpSubStr(left: Expression, right: 
Expression)
       newChildren: IndexedSeq[Expression]): RegExpSubStr =
     copy(left = newChildren(0), right = newChildren(1))
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """
+    _FUNC_(str, regexp) - Searches a string for a regular expression and 
returns an integer that indicates the beginning position of the matched 
substring. Positions are 1-based, not 0-based. If no match is found, returns 0.
+  """,
+  arguments = """
+    Arguments:
+      * str - a string expression.
+      * regexp - a string representing a regular expression. The regex string 
should be a
+          Java regular expression.<br><br>
+          Since Spark 2.0, string literals (including regex patterns) are 
unescaped in our SQL
+          parser. For example, to match "\abc", a regular expression for 
`regexp` can be
+          "^\\abc$".<br><br>
+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that 
can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. 
For example,
+          if the config is enabled, the `regexp` that can match "\abc" is 
"^\abc$".
+  """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_('[email protected]', '@[^.]*');
+       5
+  """,
+  since = "3.4.0",
+  group = "string_funcs")
+// scalastyle:on line.size.limit
+case class RegExpInStr(subject: Expression, regexp: Expression, idx: 
Expression)
+  extends RegExpExtractBase {
+  def this(s: Expression, r: Expression) = this(s, r, Literal(0))
+
+  override def nullSafeEval(s: Any, r: Any, i: Any): Any = {
+    try {
+      val m = getLastMatcher(s, r)
+      if (m.find) {
+        m.toMatchResult.start() + 1
+      } else {
+        0
+      }
+    } catch {
+      case _: IllegalStateException => 0
+    }
+  }
+
+  override def dataType: DataType = IntegerType
+  override def prettyName: String = "regexp_instr"
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+    val classNamePattern = classOf[Pattern].getCanonicalName
+    val matcher = ctx.freshName("matcher")
+
+    val termLastRegex = ctx.addMutableState("UTF8String", "lastRegex")
+    val termPattern = ctx.addMutableState(classNamePattern, "pattern")
+
+    val setEvNotNull = if (nullable) {
+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }
+
+    nullSafeCodeGen(ctx, ev, (subject, regexp, _) => {
+      s"""
+         |try {
+         |  $setEvNotNull
+         |  if (!$regexp.equals($termLastRegex)) {
+         |    // regex value changed
+         |    $termLastRegex = $regexp.clone();
+         |    $termPattern = 
$classNamePattern.compile($termLastRegex.toString());
+         |  }
+         |  java.util.regex.Matcher $matcher = 
$termPattern.matcher($subject.toString());
+         |  if ($matcher.find()) {
+         |    ${ev.value} = $matcher.toMatchResult().start() + 1;
+         |  } else {
+         |    ${ev.value} = 0;
+         |  }
+         |} catch (IllegalStateException e) {
+         |  ${ev.value} = 0;
+         |}
+         |""".stripMargin
+    })
+  }
+
+  override protected def withNewChildrenInternal(
+      newFirst: Expression, newSecond: Expression, newThird: Expression): 
RegExpInStr =
+    copy(subject = newFirst, regexp = newSecond, idx = newThird)
+}
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
index 5b007b87915..d0c03d1ccaf 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
@@ -483,4 +483,18 @@ class RegexpExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
         .likeAll("%foo%", Literal.create(null, StringType)), null)
     }
   }
+
+  test("RegExpInStr") {
+    val expr = new RegExpInStr($"s".string.at(0), $"p".string.at(1))
+    checkEvaluation(expr, 1, create_row("100-200", "(\\d+)-(\\d+)"))
+    checkEvaluation(expr, 1, create_row("100-200", "(\\d+).*"))
+    // will not match anything, empty string get
+    checkEvaluation(expr, 0, create_row("100-200", "([a-z])"))
+    checkEvaluation(expr, null, create_row(null, "([a-z])"))
+    checkEvaluation(expr, null, create_row("100-200", null))
+
+    // Test escaping of arguments
+    GenerateUnsafeProjection.generate(
+      new RegExpInStr(Literal("\"quote"), Literal("\"quote")) :: Nil)
+  }
 }
diff --git a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md 
b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
index 0305781a48d..907c62b4ee0 100644
--- a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
+++ b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
@@ -240,6 +240,7 @@
 | org.apache.spark.sql.catalyst.expressions.RegExpCount | regexp_count | 
SELECT regexp_count('Steven Jones and Stephen Smith are the best players', 
'Ste(v&#124;ph)en') | struct<regexp_count(Steven Jones and Stephen Smith are 
the best players, Ste(v&#124;ph)en):int> |
 | org.apache.spark.sql.catalyst.expressions.RegExpExtract | regexp_extract | 
SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1) | 
struct<regexp_extract(100-200, (\d+)-(\d+), 1):string> |
 | org.apache.spark.sql.catalyst.expressions.RegExpExtractAll | 
regexp_extract_all | SELECT regexp_extract_all('100-200, 300-400', 
'(\\d+)-(\\d+)', 1) | struct<regexp_extract_all(100-200, 300-400, (\d+)-(\d+), 
1):array<string>> |
+| org.apache.spark.sql.catalyst.expressions.RegExpInStr | regexp_instr | 
SELECT regexp_instr('[email protected]', '@[^.]*') | 
struct<regexp_instr([email protected], @[^.]*, 0):int> |
 | org.apache.spark.sql.catalyst.expressions.RegExpReplace | regexp_replace | 
SELECT regexp_replace('100-200', '(\\d+)', 'num') | 
struct<regexp_replace(100-200, (\d+), num, 1):string> |
 | org.apache.spark.sql.catalyst.expressions.RegExpSubStr | regexp_substr | 
SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 
'Ste(v&#124;ph)en') | struct<regexp_substr(Steven Jones and Stephen Smith are 
the best players, Ste(v&#124;ph)en):string> |
 | org.apache.spark.sql.catalyst.expressions.Remainder | % | SELECT 2 % 1.8 | 
struct<(2 % 1.8):decimal(2,1)> |
diff --git a/sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql 
b/sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql
index d828d761b77..62f69660158 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql
@@ -69,3 +69,11 @@ SELECT regexp_substr('1a 2b 14m', '\\d{2}(a|b|m)');
 SELECT regexp_substr('1a 2b 14m', '');
 SELECT regexp_substr('Spark', null);
 SELECT regexp_substr(null, '.*');
+
+-- regexp_instr
+SELECT regexp_instr('abc', 'b');
+SELECT regexp_instr('abc', 'x');
+SELECT regexp_instr('ABC', '(?-i)b');
+SELECT regexp_instr('1a 2b 14m', '\\d{2}(a|b|m)');
+SELECT regexp_instr('abc', null);
+SELECT regexp_instr(null, 'b');
diff --git 
a/sql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out
index c82b892f481..4b72c5711bd 100644
--- a/sql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out
@@ -491,3 +491,51 @@ SELECT regexp_substr(null, '.*')
 struct<regexp_substr(NULL, .*):string>
 -- !query output
 NULL
+
+
+-- !query
+SELECT regexp_instr('abc', 'b')
+-- !query schema
+struct<regexp_instr(abc, b, 0):int>
+-- !query output
+2
+
+
+-- !query
+SELECT regexp_instr('abc', 'x')
+-- !query schema
+struct<regexp_instr(abc, x, 0):int>
+-- !query output
+0
+
+
+-- !query
+SELECT regexp_instr('ABC', '(?-i)b')
+-- !query schema
+struct<regexp_instr(ABC, (?-i)b, 0):int>
+-- !query output
+0
+
+
+-- !query
+SELECT regexp_instr('1a 2b 14m', '\\d{2}(a|b|m)')
+-- !query schema
+struct<regexp_instr(1a 2b 14m, \d{2}(a|b|m), 0):int>
+-- !query output
+7
+
+
+-- !query
+SELECT regexp_instr('abc', null)
+-- !query schema
+struct<regexp_instr(abc, NULL, 0):int>
+-- !query output
+NULL
+
+
+-- !query
+SELECT regexp_instr(null, 'b')
+-- !query schema
+struct<regexp_instr(NULL, b, 0):int>
+-- !query output
+NULL


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-39744][SQL] Add the `REGEXP_INSTR` function

Reply via email to