[ 
https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758715#comment-17758715
 ] 

Julian Hyde commented on CALCITE-5914:
--------------------------------------

I saw some good review comments in [PR 
3394|https://github.com/apache/calcite/pull/3394] from [~rubenql] and [~taoran] 
- thank you.

Can I have a couple of reviews? Some potential reviewers:
 * [~tjbanghart], Can you look at the changes I made to {{PARSE_TIMESTAMP}} and 
{{{}FORMAT_DATE{}}}, related to your work on CALCITE-5543;
 * [~Runking], Can you look at the changes to {{{}PARSE_URL{}}}, related to 
CALCITE-5820;
 * [~rubenql], The changes to Posix regex might interest you, since you fixed 
CALCITE-4419.


> Cache compiled regular expressions in SQL function runtime
> ----------------------------------------------------------
>
>                 Key: CALCITE-5914
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5914
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: pull-request-available
>
> Cache compiled regular expressions (and other amortized work) in SQL function 
> runtime. Compiling a regular expression to a pattern is expensive (compared 
> to the cost of matching, given an existing pattern) and therefore caching the 
> compiled form will yield performance benefits if the regular expression is 
> constant or has a small number of values.
> Consider the following query:
> {code:java}
> SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
> FROM emp
> {code}
> The first regular expression, '{{A.\*}}', is constant and can be compiled at 
> prepare time or at the start of execution; the second regular expression, 
> {{job || '.\*'}}, might vary from one row to the next. However if the {{job}} 
> column has a small number of values it still might be beneficial to cache the 
> compiled regular expression.
> If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
> {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the 
> constant and non-constant cases.
> The cache needs to:
>  * be thread-safe (in case queries are executing using multiple threads),
>  * return thread-safe objects (as is {{{}Pattern{}}}),
>  * have bounded space (so that a query doesn't blow memory with 1 million 
> distinct regular expressions),
>  * disposed after the query has terminated,
>  * (ideally) share with regexes of the same language in the same query,
>  * not conflict with regexes of different languages in the same query.
> One possible implementation is to add an {{interface FunctionState}}, with 
> subclasses including {{class RegexpCache}}, and if argument 1 of a function 
> is a subclass of {{FunctionState}} the compiler would initialize the state in 
> the generated code. The function can rely on the state argument being 
> initialized, and being the same object from one call to the next. Example:
> {code:java}
> interface FunctionState {
> }
> class RegexpCache implements FunctionState {
>   final Cache<String, Pattern> cache = ...;
> }
> {code}
> This change should install the cache for all applicable functions, including 
> LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), 
> REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, 
> JSON_REPLACE, PARSE_TIMESTAMP.
> It can also be used for functions that have mutable state, e.g. {{RANDOM}} 
> with and without a seed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to