[
https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753199#comment-17753199
]
Thomas Rebele commented on CALCITE-5914:
----------------------------------------
I've a ticket with a similar goal in my backlog for a product using Calcite.
Maybe the new mechanism for this ticket could be made general enough to include
the following use case:
{{{}MY_FUNC(param1, param2, param3){}}}, where {{param2}} and {{param3}} are
constants (either literals or dynamic parameters, which are known at the
beginning of the query execution). The function creates an instance of a
converter from {{param2}} and {{{}param3{}}}, which is then applied to
{{{}param1{}}}. The converter would be the same for all calls (similar to
{{{}RLIKE(ename, 'A.*'){}}}).
Ideally, the dynamic code would create the converter at the beginning the first
call and reuse it for all subsequent calls. I guess the {{FunctionState}} would
be passed as an extra parameter, e.g., {{{}MY_FUNC(param1, param2, param3,
state){}}}. For the initialization I see two possibilities:
* the implementations of {{FunctionState}} need to provide a no-arg
constructor so that they can be created by {{MyFuncState state = new
MyFuncState();}} the dynamic code of the {{RexCall}} is responsible for the
correct initialization
* adding a {{createInitialFunctionState(...)}} somewhere, that would be called
by the dynamic code with the arguments of the first call. Though that does not
help for non-constant arguments as in {{RLIKE(ename, job || '.*')}} .
It could be useful to split {{MY_FUNC(param1, param2, param3)}} into two
functions (pseudo-code):
{code:java}
state = state == null ? new MyFuncState() : state;
precomputed = MY_FUNC_1(param2, param3, state)
result = MY_FUNC_2(param1, precomputed)
{code}
Maybe there should be the possibility to pass a parameter to both, i.e.,
{{MY_FUNC_1(param2, param3, state)}} and {{{}MY_FUNC_2(param1, param2,
precomputed){}}}.
The alternative approach could be beneficial for the regular expression
caching. A function {{REGEXP_COMPILE}} (i.e., {{{}pattern =
REGEXP_COMPILE(regexp, regexpCache){}}}) could be reused for all calls that use
regular expressions.
> Cache compiled regular expressions in SQL function runtime
> ----------------------------------------------------------
>
> Key: CALCITE-5914
> URL: https://issues.apache.org/jira/browse/CALCITE-5914
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Priority: Major
>
> Cache compiled regular expressions (and other amortized work) in SQL function
> runtime. Compiling a regular expression to a pattern is expensive (compared
> to the cost of matching, given an existing pattern) and therefore caching the
> compiled form will yield performance benefits if the regular expression is
> constant or has a small number of values.
> Consider the following query:
> {code:java}
> SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
> FROM emp
> {code}
> The first regular expression, '{{A.\*}}', is constant and can be compiled at
> prepare time or at the start of execution; the second regular expression,
> {{job || '.\*'}}, might vary from one row to the next. However if the {{job}}
> column has a small number of values it still might be beneficial to cache the
> compiled regular expression.
> If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to
> {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the
> constant and non-constant cases.
> The cache needs to:
> * be thread-safe (in case queries are executing using multiple threads),
> * return thread-safe objects (as is {{{}Pattern{}}}),
> * have bounded space (so that a query doesn't blow memory with 1 million
> distinct regular expressions),
> * disposed after the query has terminated,
> * (ideally) share with regexes of the same language in the same query,
> * not conflict with regexes of different languages in the same query.
> One possible implementation is to add an {{interface FunctionState}}, with
> subclasses including {{class RegexpCache}}, and if argument 1 of a function
> is a subclass of {{FunctionState}} the compiler would initialize the state in
> the generated code. The function can rely on the state argument being
> initialized, and being the same object from one call to the next. Example:
> {code:java}
> interface FunctionState {
> }
> class RegexpCache implements FunctionState {
> final Cache<String, Pattern> cache = ...;
> }
> {code}
> This change should install the cache for all applicable functions, including
> LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle),
> REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL,
> JSON_REPLACE, PARSE_TIMESTAMP.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)