Joe McDonnell created IMPALA-12374:
--------------------------------------
Summary: Explore optimizing re2 usage for leading / trailing ".*"
Key: IMPALA-12374
URL: https://issues.apache.org/jira/browse/IMPALA-12374
Project: IMPALA
Issue Type: Improvement
Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Joe McDonnell
Abseil has some recommendations about efficiently using re2 here:
[https://abseil.io/fast/21]
One recommendation it has is to avoid leading / trailing .* for FullMatch():
{noformat}
Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead,
change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch()
performs an unanchored search, so it is also necessary to anchor the regular
expression (i.e. with ^ or $) to indicate that it must match at the start or
end of the string.{noformat}
For our slow path LIKE evaluation, we convert the LIKE to a regular expression
and use FullMatch(). Our code to generate the regular expression will use
leading/trailing .* and FullMatch for patterns like '%a%b%'. We could try
detecting these cases and switching to PartialMatch with anchors. See the link
for more details about how this works.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)