[
https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julian Hyde reassigned CALCITE-5914:
------------------------------------
Assignee: Julian Hyde
> Cache compiled regular expressions in SQL function runtime
> ----------------------------------------------------------
>
> Key: CALCITE-5914
> URL: https://issues.apache.org/jira/browse/CALCITE-5914
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Assignee: Julian Hyde
> Priority: Major
> Labels: pull-request-available
>
> Cache compiled regular expressions (and other amortized work) in SQL function
> runtime. Compiling a regular expression to a pattern is expensive (compared
> to the cost of matching, given an existing pattern) and therefore caching the
> compiled form will yield performance benefits if the regular expression is
> constant or has a small number of values.
> Consider the following query:
> {code:java}
> SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
> FROM emp
> {code}
> The first regular expression, '{{A.\*}}', is constant and can be compiled at
> prepare time or at the start of execution; the second regular expression,
> {{job || '.\*'}}, might vary from one row to the next. However if the {{job}}
> column has a small number of values it still might be beneficial to cache the
> compiled regular expression.
> If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to
> {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the
> constant and non-constant cases.
> The cache needs to:
> * be thread-safe (in case queries are executing using multiple threads),
> * return thread-safe objects (as is {{{}Pattern{}}}),
> * have bounded space (so that a query doesn't blow memory with 1 million
> distinct regular expressions),
> * disposed after the query has terminated,
> * (ideally) share with regexes of the same language in the same query,
> * not conflict with regexes of different languages in the same query.
> One possible implementation is to add an {{interface FunctionState}}, with
> subclasses including {{class RegexpCache}}, and if argument 1 of a function
> is a subclass of {{FunctionState}} the compiler would initialize the state in
> the generated code. The function can rely on the state argument being
> initialized, and being the same object from one call to the next. Example:
> {code:java}
> interface FunctionState {
> }
> class RegexpCache implements FunctionState {
> final Cache<String, Pattern> cache = ...;
> }
> {code}
> This change should install the cache for all applicable functions, including
> LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle),
> REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL,
> JSON_REPLACE, PARSE_TIMESTAMP.
> It can also be used for functions that have mutable state, e.g. {{RANDOM}}
> with and without a seed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)