[jira] [Commented] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE regex

Joe McDonnell (Jira) Fri, 30 Jan 2026 15:58:05 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055571#comment-18055571
 ]


Joe McDonnell commented on IMPALA-12374:
----------------------------------------

The performance difference is somewhat subtle and depends on the actual strings 
that we are matching. Probably the best way to see this is by customizing the 
be/src/benchmarks/expr-benchmark.cc benchmark to drop the irrelevant cases and 
add a bunch of specific cases for like vs regexp_like. To run the benchmark, 
you want to have a release build and follow the steps to reduce variance here: 
[https://github.com/google/benchmark/blob/main/docs/reducing_variance.md]. So, 
on my machine, I did:
 # Disabled CPU frequency scaling
 # Disabled turbo by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 1 
(this could vary by CPU type) - This slows the system down noticeably, so undo 
this when done.
 # Run with setarch to turn off ASLR
 # Run with taskset to pin it to a CPU

So, my final command to run it is:
{noformat}
bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R 
be/build/latest/benchmarks/expr-benchmark{noformat}
I'm getting fairly reproducible results with these settings.

The cases that show a noticeable performance improvement involve larger strings 
where there are many characters before/after the match. For example:
{noformat}
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz'
 like '%x%z'
vs
regexp_like('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz',
 'x.*z$', 'cn'){noformat}
(The 'cn' argument to regexp_like is telling it to be case sensitive and have . 
match newline, which is what we use when we construct the regexp for like. See 
[https://impala.apache.org/docs/build/plain-html/topics/impala_string_functions.html#string_functions__regexp_like]
 )
{noformat}
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaamnoaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
 like '%m%o%'
vs
regexp_like('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaamnoaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
 'm.*o', 'cn'){noformat}
Certain specific cases improve by >2x and could improve by 10x for large enough 
strings, so this is probably worth pursuing. This probably won't show up on 
benchmarks apart from TPC-H Q13 (and may be fairly subtle there).

> Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE 
> regex
> -----------------------------------------------------------------------------------
>
>                 Key: IMPALA-12374
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12374
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Priority: Major
>              Labels: ramp-up
>
> Abseil has some recommendations about efficiently using re2 here: 
> [https://abseil.io/fast/21]
> One recommendation it has is to avoid leading / trailing .* for FullMatch():
> {noformat}
> Using RE2::FullMatch() with leading or trailing .* is an antipattern. 
> Instead, change it to RE2::PartialMatch() and remove the .*. 
> RE2::PartialMatch() performs an unanchored search, so it is also necessary to 
> anchor the regular expression (i.e. with ^ or $) to indicate that it must 
> match at the start or end of the string.{noformat}
> For our slow path LIKE evaluation, we convert the LIKE to a regular 
> expression and use FullMatch(). Our code to generate the regular expression 
> will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We 
> could try detecting these cases and switching to PartialMatch with anchors. 
> See the link for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE regex

Reply via email to