[
https://issues.apache.org/jira/browse/IMPALA-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055571#comment-18055571
]
Joe McDonnell commented on IMPALA-12374:
----------------------------------------
The performance difference is somewhat subtle and depends on the actual strings
that we are matching. Probably the best way to see this is by customizing the
be/src/benchmarks/expr-benchmark.cc benchmark to drop the irrelevant cases and
add a bunch of specific cases for like vs regexp_like. To run the benchmark,
you want to have a release build and follow the steps to reduce variance here:
[https://github.com/google/benchmark/blob/main/docs/reducing_variance.md]. So,
on my machine, I did:
# Disabled CPU frequency scaling
# Disabled turbo by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 1
(this could vary by CPU type) - This slows the system down noticeably, so undo
this when done.
# Run with setarch to turn off ASLR
# Run with taskset to pin it to a CPU
So, my final command to run it is:
{noformat}
bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R
be/build/latest/benchmarks/expr-benchmark{noformat}
I'm getting fairly reproducible results with these settings.
The cases that show a noticeable performance improvement involve larger strings
where there are many characters before/after the match. For example:
{noformat}
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz'
like '%x%z'
vs
regexp_like('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaxyz',
'x.*z$', 'cn'){noformat}
(The 'cn' argument to regexp_like is telling it to be case sensitive and have .
match newline, which is what we use when we construct the regexp for like. See
[https://impala.apache.org/docs/build/plain-html/topics/impala_string_functions.html#string_functions__regexp_like]
)
{noformat}
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaamnoaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
like '%m%o%'
vs
regexp_like('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaamnoaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
'm.*o', 'cn'){noformat}
Certain specific cases improve by >2x and could improve by 10x for large enough
strings, so this is probably worth pursuing. This probably won't show up on
benchmarks apart from TPC-H Q13 (and may be fairly subtle there).
> Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE
> regex
> -----------------------------------------------------------------------------------
>
> Key: IMPALA-12374
> URL: https://issues.apache.org/jira/browse/IMPALA-12374
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 4.3.0
> Reporter: Joe McDonnell
> Priority: Major
> Labels: ramp-up
>
> Abseil has some recommendations about efficiently using re2 here:
> [https://abseil.io/fast/21]
> One recommendation it has is to avoid leading / trailing .* for FullMatch():
> {noformat}
> Using RE2::FullMatch() with leading or trailing .* is an antipattern.
> Instead, change it to RE2::PartialMatch() and remove the .*.
> RE2::PartialMatch() performs an unanchored search, so it is also necessary to
> anchor the regular expression (i.e. with ^ or $) to indicate that it must
> match at the start or end of the string.{noformat}
> For our slow path LIKE evaluation, we convert the LIKE to a regular
> expression and use FullMatch(). Our code to generate the regular expression
> will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We
> could try detecting these cases and switching to PartialMatch with anchors.
> See the link for more details about how this works.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]