[ 
https://issues.apache.org/jira/browse/CALCITE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752550#comment-17752550
 ] 

Jerin John commented on CALCITE-5910:
-------------------------------------

[~julianhyde] posting the comment from the other issue here for reference, 
here's the example encountered when testing that implementation:

The REGEXP_CONTAINS function in BQ is expected to return an error if the regexp 
argument is invalid. To mimic this functionality we went with the 
Pattern.compile() method from the native java.util.regex library, which parses 
the expression into a regex object and throws a PatternSyntaxException for 
invalid scenarios.

BigQuery/GoogleSQL uses the RE2 library to support regex evaluations (as 
mentioned in [BQ 
docs)|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_contains]
 and is able to detect a few additional invalid cases that the java regex 
library handles incorrectly.
Eg:

{{SELECT REGEXP_CONTAINS('abc def ghi', '\{3}');}}

{{{}BQ Error: Cannot parse regular expression: no argument for repetition 
operator: {3{}}}}

 

{{SELECT REGEXP_CONTAINS('abc def ghi', '\d');}}

{{BQ Error: Syntax error: Illegal escape sequence: \d at [1:40]}}

 

The above examples are accepted by the java regex library and returns an 
incorrect boolean result instead of the expected errors from BQ, we need to 
consider the need to handle these conditions explicitly or import the re2j 
library for Java to do the parsing.

> Add REGEXP_EXTRACT and REGEXP_SUBSTR functions (enabled in BigQuery library)
> ----------------------------------------------------------------------------
>
>                 Key: CALCITE-5910
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5910
>             Project: Calcite
>          Issue Type: Task
>            Reporter: Jerin John
>            Assignee: Jerin John
>            Priority: Major
>              Labels: pull-request-available
>
> Add support for 
> [REGEXP_EXTRACT|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract]
>  and 
> [REGEXP_SUBSTR|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr]
>  functions from BigQuery.
> *{{REGEXP_EXTRACT(value, regexp[, position[, occurrence]])}}*
> Returns the substring in {{value}} that matches the regular expression 
> {{{}regexp{}}}. Returns {{NULL}} if there is no match.
>  * If the regular expression contains a capturing group ({{{}(...){}}}), and 
> there is a match for that capturing group, that match is returned. If there 
> are multiple matches for a capturing group, the last match is returned.
>  * If {{position}} is specified, the search starts at this position in 
> {{{}value{}}}, otherwise it starts at the beginning of {{{}value{}}}.
>  * If {{occurrence}} is specified, the search returns a specific occurrence 
> of the {{regexp}} in {{{}value{}}}, otherwise returns the first match.
>  
> *{{REGEXP_SUBSTR(value, regexp[, position[, occurrence]])}}*
> Synonym for REGEXP_EXTRACT



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to