[
https://issues.apache.org/jira/browse/CALCITE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752187#comment-17752187
]
Mihai Budiu commented on CALCITE-5910:
--------------------------------------
First, let me make the disclaimer that I am still relatively new to Calcite, so
what I say below may be wrong. There may be some additional complications that
I do not even know about.
I have contemplated the problem of regular expressions myself and I have not
reached a happy conclusion. The difficulty is that there are *many* languages
for describing regular expressions. They tend to have a large common area, but
then they differ on corner cases, as the other comments points out. SQL has
LIKE, but also the more complex SIMILAR. On top of that databases add their own
regex dialects, for example Postgres has Posix regular expressions:
https://www.postgresql.org/docs/15/functions-matching.html
Calcite already has RLIKE, and a bunch of REGEX functions for various dialects.
RLIKE is *defined* to use the Java regex library, and that's very convenient
for Calcite.
As I see it, these functions can be evaluated in two places:
- at compilation, using the PROJECT_REDUCE_EXPRESSIONS optimizer rule, when the
plan is optimized
- at runtime, using whatever runtime you are using
These two evaluation methods better give the same results for constant
expressions, otherwise users may have very unpleasant surprises.
If you need 100% compatibility with a certain dialect of regexes, I think you
have two choices:
- use a Java library that faithfully implements the regex library
- prevent the compiler from ever evaluating such expressions at compile-time.
There are least two ways you can do that: (1) treat the function like a
user-defined function, or (2) claim it is non-deterministic
If you will always execute the Calcite-optimized code on your backend (e.g.,
BigQuery), maybe you can get away without implementing the functions in Java.
If you want to use Calcite as a runtime I believe that you have only one choice
left, and you better pray that there is a well-tested Java library implementing
your functions.
> Add REGEXP_EXTRACT function (enabled in BigQuery library)
> ---------------------------------------------------------
>
> Key: CALCITE-5910
> URL: https://issues.apache.org/jira/browse/CALCITE-5910
> Project: Calcite
> Issue Type: Task
> Reporter: Jerin John
> Assignee: Jerin John
> Priority: Major
> Labels: pull-request-available
>
> Add support for
> [REGEXP_EXTRACT|https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract]
> function from BigQuery.
> *{{REGEXP_EXTRACT(value, regexp[, position[, occurrence]])}}*
> Returns the substring in {{value}} that matches the regular expression
> {{{}regexp{}}}. Returns {{NULL}} if there is no match.
> * If the regular expression contains a capturing group ({{{}(...){}}}), and
> there is a match for that capturing group, that match is returned. If there
> are multiple matches for a capturing group, the last match is returned.
> * If {{position}} is specified, the search starts at this position in
> {{{}value{}}}, otherwise it starts at the beginning of {{{}value{}}}.
> * If {{occurrence}} is specified, the search returns a specific occurrence
> of the {{regexp}} in {{{}value{}}}, otherwise returns the first match.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)