[ 
https://issues.apache.org/jira/browse/CALCITE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753199#comment-17753199
 ] 

Thomas Rebele commented on CALCITE-5914:
----------------------------------------

I've a ticket with a similar goal in my backlog for a product using Calcite. 
Maybe the new mechanism for this ticket could be made general enough to include 
the following use case:

{{{}MY_FUNC(param1, param2, param3){}}}, where {{param2}} and {{param3}} are 
constants (either literals or dynamic parameters, which are known at the 
beginning of the query execution). The function creates an instance of a 
converter from {{param2}} and {{{}param3{}}}, which is then applied to 
{{{}param1{}}}. The converter would be the same for all calls (similar to 
{{{}RLIKE(ename, 'A.*'){}}}).

Ideally, the dynamic code would create the converter at the beginning the first 
call and reuse it for all subsequent calls. I guess the {{FunctionState}} would 
be passed as an extra parameter, e.g., {{{}MY_FUNC(param1, param2, param3, 
state){}}}. For the initialization I see two possibilities:
 * the implementations of {{FunctionState}} need to provide a no-arg 
constructor so that they can be created by {{MyFuncState state = new 
MyFuncState();}} the dynamic code of the {{RexCall}} is responsible for the 
correct initialization
 * adding a {{createInitialFunctionState(...)}} somewhere, that would be called 
by the dynamic code with the arguments of the first call. Though that does not 
help for non-constant arguments as in {{RLIKE(ename, job || '.*')}} .

It could be useful to split {{MY_FUNC(param1, param2, param3)}} into two 
functions (pseudo-code):
{code:java}
state = state == null ? new MyFuncState() : state;
precomputed = MY_FUNC_1(param2, param3, state)
result = MY_FUNC_2(param1, precomputed)
{code}
Maybe there should be the possibility to pass a parameter to both, i.e., 
{{MY_FUNC_1(param2, param3, state)}} and {{{}MY_FUNC_2(param1, param2, 
precomputed){}}}.

The alternative approach could be beneficial for the regular expression 
caching. A function {{REGEXP_COMPILE}} (i.e., {{{}pattern = 
REGEXP_COMPILE(regexp, regexpCache){}}}) could be reused for all calls that use 
regular expressions.

> Cache compiled regular expressions in SQL function runtime
> ----------------------------------------------------------
>
>                 Key: CALCITE-5914
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5914
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Priority: Major
>
> Cache compiled regular expressions (and other amortized work) in SQL function 
> runtime. Compiling a regular expression to a pattern is expensive (compared 
> to the cost of matching, given an existing pattern) and therefore caching the 
> compiled form will yield performance benefits if the regular expression is 
> constant or has a small number of values.
> Consider the following query:
> {code:java}
> SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
> FROM emp
> {code}
> The first regular expression, '{{A.\*}}', is constant and can be compiled at 
> prepare time or at the start of execution; the second regular expression, 
> {{job || '.\*'}}, might vary from one row to the next. However if the {{job}} 
> column has a small number of values it still might be beneficial to cache the 
> compiled regular expression.
> If {{SqlFunctions.rlike}} could use a cache (mapping from {{String}} to 
> {{{}java.util.regex.Pattern{}}}) then it would achieve benefits in both the 
> constant and non-constant cases.
> The cache needs to:
>  * be thread-safe (in case queries are executing using multiple threads),
>  * return thread-safe objects (as is {{{}Pattern{}}}),
>  * have bounded space (so that a query doesn't blow memory with 1 million 
> distinct regular expressions),
>  * disposed after the query has terminated,
>  * (ideally) share with regexes of the same language in the same query,
>  * not conflict with regexes of different languages in the same query.
> One possible implementation is to add an {{interface FunctionState}}, with 
> subclasses including {{class RegexpCache}}, and if argument 1 of a function 
> is a subclass of {{FunctionState}} the compiler would initialize the state in 
> the generated code. The function can rely on the state argument being 
> initialized, and being the same object from one call to the next. Example:
> {code:java}
> interface FunctionState {
> }
> class RegexpCache implements FunctionState {
>   final Cache<String, Pattern> cache = ...;
> }
> {code}
> This change should install the cache for all applicable functions, including 
> LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), 
> REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, 
> JSON_REPLACE, PARSE_TIMESTAMP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to