bziobrowski opened a new pull request, #14833:
URL: https://github.com/apache/pinot/pull/14833

   PR optimizes a number of string scalar functions, including:
   - ltrim
   - rtrim
   - unique_ngrams
   - concat
   - concat_ws
   and adds new version of other functions optimized assuming that pattern is 
constant : 
   - regexp_replace_const
   - regexp_like_const
   - regexp_extract_const
   - replace _const
   - like_const
   
   All of the functions mentioned above have been changed to initialize 
temporary objects and clear/reuse them in each call.
   As can be seen in the following benchmark output, this change can speed up a 
raw function call even 4+ times.
   
   ```
   Benchmark                                          (_regex)  Mode  Cnt    
Score    Error  Units
   BenchmarkRegexpReplace.testRegexpReplaceConst  q.[aeiou]c.*  avgt    3   
25.720 ±  0.262  us/op
   BenchmarkRegexpReplace.testRegexpReplaceConst           .*a  avgt    3   
92.530 ±  2.315  us/op
   BenchmarkRegexpReplace.testRegexpReplaceConst           b.*  avgt    3   
34.444 ±  3.076  us/op
   BenchmarkRegexpReplace.testRegexpReplaceConst            .*  avgt    3   
42.251 ±  1.791  us/op
   BenchmarkRegexpReplace.testRegexpReplaceConst        .*ated  avgt    3  
121.553 ±  1.767  us/op
   BenchmarkRegexpReplace.testRegexpReplaceConst        .*ba.*  avgt    3  
130.567 ±  1.258  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld    q.[aeiou]c.*  avgt    3  
101.532 ± 10.586  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld             .*a  avgt    3  
153.493 ±  8.621  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld             b.*  avgt    3   
75.913 ±  2.909  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld              .*  avgt    3   
75.989 ±  4.248  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld          .*ated  avgt    3  
214.719 ± 91.627  us/op
   BenchmarkRegexpReplace.testRegexpReplaceOld          .*ba.*  avgt    3  
212.798 ±  5.929  us/op
   ```
   
   If query processing is dominated by function call then effect on actual 
query performance is similar:
   ```
   Benchmark                   (_numRows)                                       
                                                                               
(_query)  (_scenario)  Mode  Cnt   Score   Error  Units
   BenchmarkQueriesMSQE.query     1500000  select * from 
   (
     select RAW_STRING_COL
     from MyTable 
     limit 100000
   ) 
   where regexp_like_const('.*a.*', RAW_STRING_COL )   EXP(0.001)  avgt    5  
12.351 ± 1.591  ms/op
   
   BenchmarkQueriesMSQE.query     1500000        select * from 
   (
     select RAW_STRING_COL
     from MyTable 
     limit 100000
   ) 
   where regexp_like('.*a.*', RAW_STRING_COL )   EXP(0.001)  avgt    5  42.298 
± 3.225  ms/op
   
   ```
   
   NOTE: the reason I added _const function is that currently there is no way 
for engine to choose implementation based on function argument being constant 
or variable. If we change, e.g. regexp_replace, it will start returning wrong 
results if regular expression is variable, without raising an error or warning. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to