[ 
https://issues.apache.org/jira/browse/CALCITE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752187#comment-17752187
 ] 

Mihai Budiu commented on CALCITE-5910:
--------------------------------------

First, let me make the disclaimer that I am still relatively new to Calcite, so 
what I say below may be wrong. There may be some additional complications that 
I do not even know about.

I have contemplated the problem of regular expressions myself and I have not 
reached a happy conclusion. The difficulty is that there are *many* languages 
for describing regular expressions. They tend to have a large common area, but 
then they differ on corner cases, as the other comments points out. SQL has 
LIKE, but also the more complex SIMILAR. On top of that databases add their own 
regex dialects, for example Postgres has Posix regular expressions: 
https://www.postgresql.org/docs/15/functions-matching.html

Calcite already has RLIKE, and a bunch of REGEX functions for various dialects. 
RLIKE is *defined* to use the Java regex library, and that's very convenient 
for Calcite.

As I see it, these functions can be evaluated in two places:
- at compilation, using the PROJECT_REDUCE_EXPRESSIONS optimizer rule, when the 
plan is optimized
- at runtime, using whatever runtime you are using

These two evaluation methods better give the same results for constant 
expressions, otherwise users may have very unpleasant surprises.

If you need 100% compatibility with a certain dialect of regexes, I think you 
have two choices:
- use a Java library that faithfully implements the regex library
- prevent the compiler from ever evaluating such expressions at compile-time. 
There are least two ways you can do that: (1) treat the function like a 
user-defined function, or (2) claim it is non-deterministic

If you will always execute the Calcite-optimized code on your backend (e.g., 
BigQuery), maybe you can get away without implementing the functions in Java.

If you want to use Calcite as a runtime I believe that you have only one choice 
left, and you better pray that there is a well-tested Java library implementing 
your functions.

> Add REGEXP_EXTRACT function (enabled in BigQuery library)
> ---------------------------------------------------------
>
>                 Key: CALCITE-5910
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5910
>             Project: Calcite
>          Issue Type: Task
>            Reporter: Jerin John
>            Assignee: Jerin John
>            Priority: Major
>              Labels: pull-request-available
>
> Add support for 
> [REGEXP_EXTRACT|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract]
>  function from BigQuery.
> *{{REGEXP_EXTRACT(value, regexp[, position[, occurrence]])}}*
> Returns the substring in {{value}} that matches the regular expression 
> {{{}regexp{}}}. Returns {{NULL}} if there is no match.
>  * If the regular expression contains a capturing group ({{{}(...){}}}), and 
> there is a match for that capturing group, that match is returned. If there 
> are multiple matches for a capturing group, the last match is returned.
>  * If {{position}} is specified, the search starts at this position in 
> {{{}value{}}}, otherwise it starts at the beginning of {{{}value{}}}.
>  * If {{occurrence}} is specified, the search returns a specific occurrence 
> of the {{regexp}} in {{{}value{}}}, otherwise returns the first match.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to